Advertisements

Installing Cloudera Manager in an existing hadoop cluster

Cloudera Manager is an Infrastructure management and monitoring tool provided by cloudera. This has now became a very excellent tool to manage bigdata infrastructure. The pain of administrators has been reduced by 80% with this cloudera manager. Almost everything required for an administrator is integrated into this great software and is very user friendly. Cloudera Manager became this muhc powerful recently. So lot of existing clusters are still running without using cloudera manager. If you want to manage an existing cluster using cloudera manager, the following steps may help you. For this you have to completely uninstall the existing hadoop set up. No data loss will happen because we are not touching any data. The configurations also will remain the same. These are just pointers.

1) Stop all the services
2) Back up hive metastore, Namenode metadata and all the other required metastores (Eg hue, oozie)
3) Back up all the configurations
4) Note down the existing storage directories
5) Uninstall all the hadoop services (Never touch the data)
6) Install Cloudera Manager Server and Agent
7) Install all the services (It should be same version as that of previous to make installation smoother)
8) Add the configurations (Use the same configurations as that of previous. There is an option to add xml configs in CM)
9) Point the storage directories in the cloudera manager configurations.
10) Point the new installation to the existing metastore (hive, oozie, hue etc)
11) Start all the services (Don’t format the namenode)
12) Test the cluster
Advertisements

Migrating Namenode from one host to another host

Namenode is the heart of the hadoop cluster. So namenode will be installed in a good quality machine compared to the other nodes. If we want to migrate namenode from one node to another node, the following steps are required. This is a rare scenario.

Manual Approach

Method 1: (By migrating the harddrive)

  • Stop all the running jobs in the cluster
  • Enter into Namenode Safe
    • hdfs dfsadmin -safemode enter
  • Execute the following command to save the currrent namespace to the storage directories and reset editlogs..
    • hdfs dfsadmin -saveNamespace
  • Stop the entire cluster
  • Remove the hard disk from the old namenode host and attach it to the new namenode host
  • Release the ipaddress from the old namenode host and assign it to the new namenode host
  • Start the new namenode (DO NOT PERFORM FORMAT)
  • Start all the services

Method 2: (New Harddrive)

  • Stop all the running jobs in the cluster
  • Enter into Namenode Safe
    • hdfs dfsadmin -safemode enter
  • Execute the following command to save the currrent namespace to the storage directories and reset editlogs..
    • hdfs dfsadmin -saveNamespace
  • Stop the entire cluster
  • Login to the namenode host.
  • Navigate to the namenode storage directories.
  • Copy the namenode metadata. Always better to keep this as a compressed file. Notedown the folder and file permissions & ownership.
  • Take a back up of the configuration files.
  • Install namenode of the same version as that of the existing system to the new machine.
  • Ensure that the ipaddress of the old host is taken and assigned to the new host.
  • Copy the configuration files and metadata to the new namenode host
  • Create namenode storage directory structure in the new host.
  • Maintain the same folder permissions and ownership in the new host also.
  • If there are any changes in namenode directory structure, make the corresponding changes in config files.
  • Incase of a kerberised cluster, create appropriate principles for the new host and place the proper keytabs.
  • Start the new namenode. (DO NOT PERFORM FORMAT)
  • Start the remaining services.
  • Test the working of the cluster by executing file system operations as well as MR operations.

Automated Approach in a cluster managed using Cloudera Manager (CM above 5.4)

If you are using cloudera manager 5.4 or above, there is a new feature known as Namenode Role Migration that helps us to migrate namenode from one host to another. This requires HDFS HA to be enabled.

Namenode High Availability

Last week I tried Namenode HA using Cloudera Distribution of Hadoop (CDH 4.5.0).

It is very easy to configure and automatic fail over is happening smoothly. Tried the redundancy by pulling the power cable of one of the namenodes. It was successful. For more details, visit the official website of Cloudera.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/CDH4-Installation-Guide.html

I have tried this HA in a 64 TB hadoop cluster.

Migrating hive from one hadoop cluster to another cluster

Recently I have migrated a hive installation from one cluster to another cluster. I havent find any
document regarding this migration. So I did it with my experience and knowledge.
Hive stores the metadata in some databases, ie it stores the data about the tables in some database.
For developement/ production grade installations, we normally use mysql/oracle/postgresql databases.
Here I am explaining about the migration of the hive with its metastore database in mysql.
The metadata contains the information about the tables. The contents of the table are stored in hdfs.
So the metadata contains hdfs uri and other details. So if we migrate hive from one cluster to another
cluster, we have to point the metadata to the hdfs of new cluster. If we haven’t do this, it will point
to the hdfs of older cluster.

For migrating a hive installation, we have to do the following things.

1) Install hive in the new hadoop cluster
2) Transfer the data present in the hive metastore directory (/user/hive/warehouse) to the new hadoop
cluster
3) take the mysql metastore dump.
4) Install mysql in the new hadoop cluster
5) Open the hive mysql-metastore dump using text readers such as notepad, notepad++ etc and search for
hdfs://ip-address-old-namenode:port and replace with hdfs://ip-address-new-namenode:port and save it.

Where ip-address-old-namenode is the ipaddress of namenode of old hadoop cluster and ip-address-
new-namenode
is the ipaddress of namenode of new hadoop cluster.

6) After doing the above steps, restore the editted mysql dump into the mysql of new hadoop cluster.
7) Configure hive as normal and do the hive schema upgradations if needed.

This is a solution that I discovered when I faced the migration issues. I dont know whether any other
standard methods are available.
This worked for me perfectly. 🙂

Deployment and Management of Hadoop Clusters

Back Up Mechanism for Namenode

Namenode is the single point of failure in hadoop cluster. Because it stores the metadata of the entire hadoop system.
So extra care should be given in maintaining it. We use the best hardware for namenode machines.
Even if we use best hardware, complete protection cannot be guarenteed, because hardware issues can happen at anytime. So a backup for namenode is very necessary.
One of the methods is creating a simple backup storage by mounting the partition of another machine located in a different place to the namenode machine.
The back up machine should have the same hardware/software specifications as of namenode machine and installed with hadoop similar to namenode machine. But hadoop services are not started in that machine.
Incase of failure, we can start namenode in this backup machine and it runs like normal namenode. The only thing we need to do is assigning ipaddress/hostname of actual namenode to the backup namenode.

In the hdfs-site.xml, we are giving an additional value to dfs.name.dir property.
ie actual location, backup location.

Eg:

<property>
<name>dfs.name.dir</name>
<value>/app/hadoop/name,/app/hadoop/backup</value>
<description>
Determines where on the local filesystem the DFS name node should store the name table(fsimage). If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. 
</description>
</property>

here /app/hadoop/name is the actual namenode storage location and /app/hadoop/backup is the location where the partition is mounted for storing the namenode backup.
In case of failure of the first namenode machine, the namenode data will be safe in the second machine(backup), so we can start the namenode in the second machine.
The second machine is placed in different location and is provided with a differnt power supply, so that the dependencies of both the machines will be different, thus making an efficient backup.

Recovery of deleted files in Hadoop

There may be incidents which we accidently delete necessary files from hadoop. Sometimes the entire file system may get deleted. For doing recovery process the below steps may help you.

For doing this recovery method  trash should be enabled in hdfs. Trash can be enabled by setting the property  fs.trash.interval greater than 0. By default the value is zero.  Its value is number of minutes after which the checkpoint gets deleted. If zero, the trash feature is disabled. We have to set this property in core-site.xml.

<property>
  <name>fs.trash.interval</name>
  <value>30</value>
  <description>Number of minutes after which the checkpoint
  gets deleted.
  If zero, the trash feature is disabled.
  </description>
</property>

There is one more property which is having relation with the above property called fs.trash.checkpoint.interval. It is the number of minutes between trash checkpoints. This should be smaller or equal to  fs.trash.interval. Everytime the checkpointer runs, it creates a new checkpoint out of current and removes checkpoints created more than fs.trash.interval minutes ago.The default value of this property is zero.

<property>
  <name>fs.trash.checkpoint.interval</name>
  <value>15</value>
  <description>Number of minutes between trash checkpoints.
  Should be smaller or equal to fs.trash.interval.
  Every time the checkpointer runs it creates a new checkpoint 
  out of current and removes checkpoints created more than 
  fs.trash.interval minutes ago.
  </description>
</property>

If the above properties are enabled in your cluster. Then the deleted files will be present in .Trash directory of hdfs. You have time to recover the files until the next checkpoint occurs. After the new checkpoint the deleted files will not be present in the .Trash. So recover before the new checkpoint. If this property is not enabled in your cluster,  you can enable this for future recovery.. 🙂

HDFS Operations Using Java Program

We are familiar with Hadoop Distributed File System operations such as copyFromLocal, copyToLocal, mv, cp, rmr etc.
Here I am explaining the method to do these operations using Java API. Currently I am explaining the programs to do copyFromLocal and copyToLocal functions only.

Here I used eclipse IDE for programming which is installed in my windows desktop machine.
I have a hadoop cluster. The cluster machines and my destop machine are in the same network.

First create a java project and inside that create a folder named conf. Copy the hadoop configuration files (core-site.xml, mapred-site.xml, hdfs-site.xml) from your hadoop installation to this conf folder.

Create another folder named source which we are using as the input location and put a text file inside that source folder.
One thing you have to remember is that the source and destination locations will be given appropriate permissions. Otherwise read/write will be blocked.

Copying a File from Local to HDFS

The command is
hadoop fs -copyFromLocal

package com.amal.hadoop;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

/**
 * @author amalgjose
 *
 */
public class CopyFromLocal {

	public static void main(String[] args) throws IOException {
		
		Configuration conf =new Configuration();
		conf.addResource(new Path("conf/core-site.xml"));
		conf.addResource(new Path("conf/mapred-site.xml"));
		conf.addResource(new Path("conf/hdfs=site.xml"));
		FileSystem fs = FileSystem.get(conf);
		Path sourcePath = new Path("source");
		Path destPath = new Path("/user/training");
		if(!(fs.exists(destPath)))
		{
			System.out.println("No Such destination exists :"+destPath);
			return;
		}
		
		fs.copyFromLocalFile(sourcePath, destPath);
		
	}
}

Copying a File from HDFS to Local

The command is
hadoop fs -copyToLocal

package com.amal.hadoop;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
/**
 * @author amalgjose
 *
 */
public class CopyToLocal {
public static void main(String[] args) throws IOException {
		
		Configuration conf =new Configuration();
		conf.addResource(new Path("conf/core-site.xml"));
		conf.addResource(new Path("conf/mapred-site.xml"));
		conf.addResource(new Path("conf/hdfs=site.xml"));
		FileSystem fs = FileSystem.get(conf);
		Path sourcePath = new Path("/user/training");
		Path destPath = new Path("destination");
		if(!(fs.exists(sourcePath)))
		{
			System.out.println("No Such Source exists :"+sourcePath);
			return;
		}
		
		fs.copyToLocalFile(sourcePath, destPath);
		
	}
}

Rhipe Installation

Rhipe was first developed by Saptarshi Guha.
Rhipe needs R and Hadoop. So first install R and hadooop. Installation of R and hadoop are well explained in my previous posts. The latest version of Rhipe as of now is Rhipe-0.73.1. and  latest available version of R is R-3.0.0. If you are using CDH4 (Cloudera distribution of hadoop) , use Rhipe-0.73 or later versions, because older versions may not work with CDH4.
Rhipe is an R and Hadoop integrated programming environment. Rhipe integrates R and Hadoop. Rhipe is very good for statistical and analytical calculations of very large data. Because here R is integrated with hadoop, so it will process in distributed mode, ie  mapreduce.
Futher explainations of Rhipe are available in http://www.datadr.org/

Prerequisites

Hadoop, R, protocol buffers and rJava should be installed before installing Rhipe.
We are installing Rhipe in a hadoop cluster. So the job submitted may execute in any of the tasktracker nodes. So we have to install R and Rhipe in all the tasktracker nodes, otherwise you will face an exception “Cannot find R” or something similar to that.

Installing Protocol Buffer

Download the protocol buffer 2.4.1 from the below link

http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz

tar -xzvf protobuf-2.4.1.tar.gz

cd protobuf-2.4.1

chmod -R 755 protobuf-2.4.1

./configure

make

make install

Set the environment variable PKG_CONFIG_PATH

nano /etc/bashrc

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

save and exit

Then executed the following commands to check the installation

pkg-config --modversion protobuf

This will show the version number 2.4.1
Then execute

pkg-config --libs protobuf

This will display the following things

-pthread -L/usr/local/lib -lprotobuf -lz –lpthread

If these two are working fine, This means that the protobuf is properly installed.

Set the environment variables for hadoop

For example

nano /etc/bashrc

export HADOOP_HOME=/usr/lib/hadoop

export HADOOP_BIN=/usr/lib/hadoop/bin

export HADOOP_CONF_DIR=/etc/hadoop/conf

save and exit

Then


cd /etc/ld.so.conf.d/

nano Protobuf-x86.conf

/usr/local/lib   # add this value as the content of Protobuf-x86.conf

Save and exit

/sbin/ldconfig

Installing rJava

Download the rJava tarball from the below link.

http://cran.r-project.org/web/packages/rJava/index.html

The latest version of rJava available as of now is rJava_0.9-4.tar.gz

install rJava using the following command

R CMD INSTALL rJava_0.9-4.tar.gz

Installing Rhipe

Rhipe can be downloaded from the following link
https://github.com/saptarshiguha/RHIPE/blob/master/code/Rhipe_0.73.1.tar.gz

R CMD INSTALL Rhipe_0.73.1.tar.gz

This will install Rhipe

After this type R in the terminal

You will enter into R terminal

Then type

library(Rhipe)

#This will display

------------------------------------------------

| Please call rhinit() else RHIPE will not run |

————————————————

rhinit()

#This will display

Rhipe: Detected CDH4 jar files, using RhipeCDH4.jar
Initializing Rhipe v0.73
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Initializing mapfile caches

Now you can execute you Rhipe scripts.

Hadoop Installation in Isolated Environments

Introduction

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google’s MapReduce and Google File System(GFS) papers.

Hadoop development needs a hadoop cluster.
A trial hadoop cluster can be setted up in minutes. In my previous blog post, I have mentioned about hadoop cluster set up using tarball.

But for production environments, tarball installation is not a good approach.
Because it will become complex while installing other hadoop ecosystem components.
If we have a high speed internet connection, we can install hadoop directly from the internet using yum install in minutes.
But most of the production environments are isolated. So internet connection may not be there.
There we can perform this yum install by creating a local yum repository.
Creation of local yum repository is explained well in my previous blog “Creating A Local YUM Repository”
Yum install will work with REDHAT or CentOS linux distributions.
Here I am giving the explanation of Cloudera Distribution of Hadoop installation.

Prerequisites

OPERATING SYSTEM

RedHat or CentOS (32 or 64 bit)

PORTS

The ports necessary for hadoop should be opened. So we need to set appropriate firewall rules. The ports used by hadoop are listed in the last part of this post.

If you are not interested in, you can simply switch off the firewall.

The command for turning off the firewall is

service iptables stop

JAVA

Sun java is required.

Download java from oracle website (32 or 64 bit depending on OS)

Install the java.

Simple installation and setting JAVA_HOME may not points the newly installed java as the default one.

It may still point to the openjdk if it is present.

So to point to the new java.

Do the following steps.

alternatives --config java

This will a list of java installed in the machine and to which java, it is currently pointing.

This will ask you to choose any java from the list.

Exit from this by pressing cntrl+c.

To add our sun java to this list. Do the following step.

/usr/sbin/alternatives --install /usr/bin/java java <JAVA_HOME>/bin/java 2

This will add our newly installed java to the list.

Then do

alternatives --config java

and choose the newly installed java. Now java –version will show sun java.

SETTING UP THE LOCAL HADOOP REPOSITORY

Download the Cloudera rpm repository from a place where you have internet access.

Download the repository corresponding to your OS version.
The repo file corresponding to different operating systems are listed below. Copy the repo file and download the repository.

For OS Version Click this Link
Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5
Red Hat/CentOS 6 (32-bit) Red Hat/CentOS/Oracle 6
Red Hat/CentOS 6 (64-bit) Red Hat/CentOS/Oracle 6

You can download it rpm by rpm or do a repo-sync.

Repo-sync is explained in my previous post Creating A Local YUM Repository.

Once this is done, create a local repository in one of the machines.

Then create a repo file corresponding to the newly created repository and add that repo file to all the cluster machines.

After this we can do yum install similar like a machine having internet access.

Now do a yum clean all in all the cluster machines.

Do the following steps in the corresponding nodes. The following steps explains the installation of MRV1 only.

Installation

NAMENODE MACHINE

yum install hadoop-hdfs-namenode

SECONDARY NAMENODE

yum install hadoop-hdfs-secondarynamenode

DATANODE

yum install hadoop-hdfs-datanode

JOBTRACKER

yum install hadoop-0.20-mapreduce-jobtracker

TASKTRACKER

yum install hadoop-0.20-mapreduce-tasktracker

IN ALL CLIENT MACHINES

yum install hadoop-client

Normally we run datanode and tasktracker in the same node. Ie these are co-located for data locality.

Now edit the core-site.xml, mapred-site.xml and hdfs-site.xml in all the machines.

Configurations

Sample configurations are given below.

But for production set up, you can set other properties.

core-site.xml


<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A Base dir for storing other temp directories</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://<namenode-hostname>:9000</value>
<description>The name of default file system</description>
</property>
</configuration>

mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value><jobtracker-hostname>:9001</value>
<description>Job Tracker port</description>
</property>

<property>
<name>mapred.local.dir</name>
<value>/app/hadoop/mapred_local</value>
<description>local dir for mapreduce jobs</description>
</property>


<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>6</value>
  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>
</configuration>

hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>replication factor</description>
</property>
</configuration>

Then create hadoop.tmp.dir in all the machines. Hadoop stores its files in this folder.

Here we are using the location /app/hadoop/tmp

mkdir –p /app/hadoop/tmp

mkdir /app/hadoop/mapred_local

This type of installation automatically creates two users

1) hdfs

2) mapred

The directories should be owned by hdfs, so we need to change the ownership

chown –R hdfs:hadoop /app/hadoop/tmp

chmod –R 777 /app/hadoop/tmp

chown mapred:hadoop /app/hadoop/mapred_local

The properties

mapred.tasktracker.map.tasks.maximum :- this will set the number of map slots in each node.

mapred.tasktracker.reduce.tasks.maximum :- this will set the number of reduce slots in each node.

This number is set by doing a calculation on the available RAM and jvm size of each task slot. The default size of the task slot is 200 MB. So if you have 4GB RAM free after sharing to OS requirements and other processes. We can have 4*1024 MB/200 number of task slots in that node.

ie 4*1024/200 = 20

So we can have 20 task slots which we can divide into map slots and reduce slots.
Usually we give higher number of map slots than reduce slots.
In hdfs-site.xml we are giving the replication factor. Default value is 3.

Formatting Namenode

Now go to the namenode machine and login as root user.

Then from cli, switch to hdfs user.

su – hdfs

Then format the namenode.

hadoop namenode –format

Starting Services

STARTING NAMENODE

In the namenode machine, execute the following command as root user.

/etc/init.d/hadoop-hdfs-namenode start

You can check whether the service is running or not by using the command jps.

Jps will work only if sun java is installed and added to path.

STARTING SECONDARY NAMENODE

In the Secondary Namenode machine, execute the following command as root user.

/etc/init.d/hadoop-hdfs-secondarynamenode start

STARTING DATANODE

In the Datanode machines, execute the following command as root user.

/etc/init.d/hadoop-hdfs-datanode start

Now the hdfs is started. Now we will be able to execute hdfs tasks.

STARTING JOBTRACKER

In the jobtracker machine login as root user, then switch to hdfs user.

su – hdfs

Then create the following hdfs directory structure and permissions.

hadoop fs –mkdir /app

hadoop fs –mkdir /app/hadoop

hadoop fs –mkdir /app/hadoop/tmp

hadoop fs –mkdir /app/hadoop/tmp/mapred

hadoop fs –mkdir /app/hadoop/tmp/mapred/staging

hadoop fs –chmod –R 1777 /app/hadoop/tmp/mapred/staging

hadoop fs –chown mapred:hadoop /app/hadoop/tmp/mapred

After doing this start the jobtracker

/etc/init.d/hadoop-0.20-mapreduce-jobtracker start

STARTING TASKTRACKER

In the tasktracker nodes, start the tasktracker by executing the following command

/etc/init.d/hadoop-0.20-mapreduce-tasktracker start

You can check the namenode webUI using a browser.

The URL is

http://<namenode-hostname>:50070

The Jobtracker web UI is

http://<jobtracker-hostname>:50030

If hostname resolution is not happened correctly, the hostname:port may not work.

In these situtaions you can use http://ip-address:port

Now our hadoop cluster is ready for use.

With this method, we can create hadoop cluster of any size within short time.

Here we have the entire hadoop ecosystem repository, so installing other components such as hive, pig, Hbase, sqoop etc can be done very easily.

Ports Used By Hadoop

blogports