Load Balancers – HA Proxy and ELB



Earlier I wondered how the sites like google handles the large number of requests reaching there. Later I came to know that there is a concept of load balancing. Using load balancing we can keep multiple servers in the back end and route the incoming requests to the back end servers. This will ensure faster response as well as high availability. This Load balancers play a very important role. There are a lot of opensource load balancers as well as paid services. HAProxy is one of the opensource load balancer. Amazon is providing a Load Balancer as a service known as Elastic Load Balancer (ELB).Using the load balancer, we can handle very large number of requests in a very reliable and optimal way. We can use this load balancer in Impala for load balancing the requests hitting the impala server. For on-premise environments, we can configure HAProxy and for cloud environments, we can use ELB.The ELB is a ready to use service, we just have to add the details of ports to be forwarded and the listener machines. HA Proxy is a very simple application that is available in the linux repositories. It is very easy to configure also.


Hive error in a sentry enabled cluster – “add jar” command throws “Insufficient privileges to execute add” –

Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. This is a very useful system for securing a cluster. Using sentry we can configure fine grained access to databases, directories, tables in hive and impala. Before sentry, the only way to limit access is using hdfs directory permissions and that is also not effective.

In a sentry enabled cluster, while adding jars using the command “add jar”, you will face an exception as below.

"Insufficient privileges to execute add"

You will not be able to perform add jar command from admin user also. Sentry is limiting the access the privilege to add jar from hue or beeline. For this problem, the solution is to add jar with the help of an admin globally using hive.aux.jars.path.

Impala Scratch directory issue

In impala while running queries over large data, sometimes we may get an error like this.

WARNINGS: Create file /tmp/impala-scratch/94869b99d0d6457:765d2bc009a914ad_94869b99d0d6457:765d2bc009a914af_516bff1b-7342-434e-8c95-c777bb7f237e failed with errno=2 description=Error(2): No such file or directory

Backend 1:Create file /tmp/impala-scratch/94869b99d0d6457:765d2bc009a914ad_94869b99d0d6457:765d2bc009a914af_516bff1b-7342-434e-8c95-c777bb7f237e failed with errno=2 description=Error(2): No such file or directory

One of my friends faced this issue and on investigation I found that this issue is because of the unexpected deletion of files inside the impala scratch directory. The intermediate files used during large sort, join, aggregation, or analytic function operations are stored in this scratch directory. By default the impala scratch directory is /tmp/impala-scratch. These directoroes will be deleted after the query execution. The best solution for this problem is to change the scratch directory to some other directory. This can be done by starting the impala daemon with the option –scratch_dirs=”path_to_directory”. This directory is in the local linux file system. Not in the hdfs. Impala will not start if it is not having proper read/write access to the files in the “scratch” directory.

If you are using impala in EMR cluster, to modify the start up options, make the changes in the bootstrap action. If you want to modify this conf in an existing EMR cluster, stop the service nanny in all the nodes and restart the impala with this scratch directory property. If service nanny is running, you will not be able to restart the impala with this new argument because the service nanny will perform the service restart before your restart .. 🙂

Service Nanny in AWS EMR

Service nanny is a service that runs in all the nodes of AWS EMR that controls the operation of daemons in each node.If a process gets killed because of OOM killer or overload etc, it restarts immediately and ensures that the service is alive. This service ensures that the cluster services are always alive without the problems created by unexpected exists in the services. So even if you kill a process or stop a process, it will get automatically restarted.

Recently I faced an issue with impala in AWS EMR. I was getting an error as described in this post. I was using a small  3 node EMR cluster. Instead of creating a new cluster I thought of restarting the impala daemon by specifying the additional arguments. But I was not able to perform this because the service nanny was performing the daemon start before I performing the start. So I stopped the service nanny in all the nodes and restarted impala with extra arguments and then restarted the service nanny.

We can modify service nanny control behavior by editing the config files present in /etc/service-nanny/ directory. You can see config files for each service controlled by service nanny. You can add/remove/modify the control actions  by adding/removing/modifying the config files.

Facebook Opensourced Presto

Facebook opensourced its data processing technique ‘Presto’ to the world. Presto is a distributed query engine based on ANSI SQL. It is very optimized and currently running with more than 300 petabytes of data, which may one among the top big data processing systems. Presto is a totally different from mapreduce. It is an in memory data processing mechanism and is very much optimised. From the details given in the facebook newsletter and presto website, it is 10 times faster than Hive. mapreduce.Hive came from facebook only, so presto will definitely beat hive. Hive queries are ultimately running as multiple mapreduce jobs and it will take more time. From my point of view, the competition may be between Cloudera Impala and Presto. Impala’s performance with huge datasets is not available now from any production environments because it is a budding technology from cloudera family, but presto is already tested and running in huge dataset production environment. Another interesting fact about presto is that we can use the already existing infrastructure and hadoop cluster for deploying presto, because presto supports hdfs as its underlying data storage. It supports other storage systems also. So it is flexible. Leading internet companies including Airbnb and Dropbox are using Presto. Presto code and further details are available in this link

I have deployed Presto and Impala on a small cluster of 8 nodes. I haven’t got enough time to explore more on presto. I am planning to explore more on the coming days. 🙂