Unauthorized connection for super-user: hue from IP “x.x.x.x”

If you are getting the following error in hue,

Unauthorized connection for superuser: hue from IP “x.x.x.x”

Add the following property in the core-site.xml of your hadoop cluster and restart the cluster

<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>

You may face similar error with oozie also. In that case add a similar conf for oozie user in the core-sire.xml

<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>

Advertisements

Enabling Log Aggregation in YARN

While checking the details of a YARN applications, if you are getting a message similar to “Log Aggregation not enabled”. You can follow the below steps to enable it. This issue occurs in EMR, because in most of the AMI’s the log aggregation is not enabled by default. It is very simple to enable it. Add the following configuration to the yarn-site.xml of all the yarn hosts and restart the cluster. (full cluster restart is not required. Restarting all the nodemanagers will be fine)

<property>
    <name>yarn.log-aggregation-enable</name>
    <value>true</value>
</property>

<property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/tmp/logs</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>259200</value>
</property>

<property>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>3600</value>
</property>

Installing Cloudera Manager in an existing hadoop cluster

Cloudera Manager is an Infrastructure management and monitoring tool provided by cloudera. This has now became a very excellent tool to manage bigdata infrastructure. The pain of administrators has been reduced by 80% with this cloudera manager. Almost everything required for an administrator is integrated into this great software and is very user friendly. Cloudera Manager became this muhc powerful recently. So lot of existing clusters are still running without using cloudera manager. If you want to manage an existing cluster using cloudera manager, the following steps may help you. For this you have to completely uninstall the existing hadoop set up. No data loss will happen because we are not touching any data. The configurations also will remain the same. These are just pointers.

1) Stop all the services
2) Back up hive metastore, Namenode metadata and all the other required metastores (Eg hue, oozie)
3) Back up all the configurations
4) Note down the existing storage directories
5) Uninstall all the hadoop services (Never touch the data)
6) Install Cloudera Manager Server and Agent
7) Install all the services (It should be same version as that of previous to make installation smoother)
8) Add the configurations (Use the same configurations as that of previous. There is an option to add xml configs in CM)
9) Point the storage directories in the cloudera manager configurations.
10) Point the new installation to the existing metastore (hive, oozie, hue etc)
11) Start all the services (Don’t format the namenode)
12) Test the cluster

Hive error in a sentry enabled cluster – “add jar” command throws “Insufficient privileges to execute add” –

Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. This is a very useful system for securing a cluster. Using sentry we can configure fine grained access to databases, directories, tables in hive and impala. Before sentry, the only way to limit access is using hdfs directory permissions and that is also not effective.

In a sentry enabled cluster, while adding jars using the command “add jar”, you will face an exception as below.

"Insufficient privileges to execute add"

You will not be able to perform add jar command from admin user also. Sentry is limiting the access the privilege to add jar from hue or beeline. For this problem, the solution is to add jar with the help of an admin globally using hive.aux.jars.path.

Hadoop Cluster Migrator

Hadoop Cluster Migrator tool provides a unified interface for copying data from one cluster to another cluster. Traditionally, DistCpy tool provided by Hadoop is used to migrate and copy the data from one Hadoop cluster to other. However, Distcpy works only when the connectivity between the source and target cluster is well established without any firewall rules blocking this connectivity. But in production scenarios, the edge node isolate the clusters from each other and Distcpy can’t be used for transfer of data or backup of the cluster. This is where Hadoop Cluster Migrator from Knowledge Lens can be very handy.

Hadoop Cluster Migrator is a cluster agnostic tool, which supports migration across different distribution, different version of Hadoop. Currently we support MapR, Cloudera, Hortworks, EMC Pivotal and Apache distribution of Hadoop, both in Kerberos enabled and disabled mode.

This completely Java based tool provides large scale data transfer between the cluster with transfer rate in the range of 10GB/s depending upon the bandwidth available. The tool is completely restartable and restarts from the point where the last transfer was stopped.

For more details refer : Hadoop Cluster Migrator

migrator

What is EMR ?

What is  EMR.?

EMR is a cloud service provided by amazon. Its full form is Elastic Mapreduce.

We can launch hadoop clusters of our desired size in few minutes using this service.

We can simply increase or decrease the number of nodes in the cluster while running without any disturbance. That is why it is called as Elastic. It is very simple to operate and doesn’t require much administration skills. Pay for whatever we use, no need of server room , cooling mechanism, power backup etc. We will get everything very fast for affordable amount. We can configure hadoop , hadoop ecosystem components such as hive, pig, impala etc in an emr cluster.

Now shark and spark are also available with EMR. If we need any additional services to be iinstalled in our cluster, we can create our own custom bootstrap script for installing those services in the cluster and add the script while launching the cluster.

There are three types of nodes in an EMR cluster. Master, Core and Task.

Master node contains the master daemons in hadoop cluster such as Namenode and Jobtracker for MRv1 and Namenode and Resource Manager in case of YARN. Core node contains Datanode and Tasktracker for MRv1 and Datanode and Node manager for YARN. Task nodes contains the processing daemons only,ie tasktracker or nodemanager. After launching a cluster we can increase the number of Core nodes and Task nodes, but we can decrease only the number of task nodes. We can’t reduce the number of core nodes, because core nodes contains datanodes which will store, decreasing the number of datanodes may result in data loss.

A super cool library called Boto is available in python for dealing with EMR.

Why EMR cannot be launched in all type of VPCs.?

For launching an EMR, the VPC should have an internet gateway and a subnet. So if internet is restricted in the VPC, EMR cannot be launched. The reason for this is, while launching an EMR, it contacts with some remote locations for downloading the required softwares and installation scripts. So if internet is not available, that connection will be blocked which results in installation failure.

Namenode High Availability

Last week I tried Namenode HA using Cloudera Distribution of Hadoop (CDH 4.5.0).

It is very easy to configure and automatic fail over is happening smoothly. Tried the redundancy by pulling the power cable of one of the namenodes. It was successful. For more details, visit the official website of Cloudera.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/CDH4-Installation-Guide.html

I have tried this HA in a 64 TB hadoop cluster.

Deployment and Management of Hadoop Clusters