December 8, 2012 3 Comments
History of Hadoop
Hadoop is an open source framework written in java for processing mahout and complex data sets in parallel. Doug Cutting, the developer of Hadoop named it after his son’s toy elephant. It evolved to support Lucene and Nutch after the release of a paper by Google about GFS in 2003. It works with unindexed, unstructured and unsorted data. The main parts of Hadoop are
HDFS (Hadoop Distributed File System)
HDFS is a
- Distributed : Data is split into blocks and stored in different data nodes for faster execution by doing computation nearer to data.
- Scalable : According to demand the resource can be scaled by distributing computation and storage across many servers. Data is broken down into blocks which can be executed in smaller chunks by map reduce programs , thus scalability can be increased
filesystem which stores metadata in NameNode and application data in DataNode. Fault tolerance is achieved by replication factor of 3.
It is a framework used for parallel processing humungous data sets using clusters. It has two phases.
- Map phase : In this phase the data need to be separated out which need to be processed.
- Reduce phase : In this phase the data from the map phase is collected and analysis has to be done.
A cluster is a group of computers connected via a network. Similarly a Hadoop Cluster can also be a combination of a number of systems connected together which completes the picture of distributed computing. Hadoop uses a master slave architecture.
Components required in the cluster
Name node is the master server of the cluster. It doesnot store any file but knows where the blocks are stored in the child nodes and can give pointers and can re-assemble .Namenodes comes up with two features say Fsimage and the edit log.FSImage and edit log
- Highly memory intensive
- Keeping it safe and isolated is necessary
- Manages the file system namespaces
Child nodes are attached to the main node.
- Data node has a configuration file to make itself available in the cluster .Again they stores data regarding storage capacity(Ex:5 out f 10 is available) of that particular data node.
- Data nodes are independent ,since they are not pointing to any other data nodes.
- Manages the storage attached to the node.
- There will be multiple data nodes in a cluster.
- Schedules and assign task to the different datanodes.
- Work Flow
- Takes the request.
- Assign the task.
- Validate the requested work.
- Checks whether all the data nodes are working properly.
- If not, reschedule the tasks.
Job Tracker and task tracker works in a master slave model. Every datanode has got a task tracker which actually performs the task which ever assigned to it by the Job tracker.
Secondary Name Node
Secondaryname node is not a redundant namenode but this actually provides the check pointing and housekeeping tasks periodically.
Types of Hadoop Installations
- Standalone (local) mode: It is used to run Hadoop directly on your local machine. By default Hadoop is configured to run in this mode. It is used for debugging purpose.
- Pseudo-distributed mode: It is used to stimulate multi node installation using a single node setup. We can use a single server instead of installing Hadoop in different servers.
- Fully distributed mode: In this mode Hadoop is installed in all the servers which is a part of the cluster. One machine need to be designated as NameNode and another one as JobTracker. The rest acts as DataNode and TaskTracker.
How to make a Single node Hadoop Cluster
A Single node cluster is a cluster where all the Hadoop daemons run on a single machine. The development can be described as several steps.
Hadoop is meant to be deployed on Linux based platforms which includes OS like Mackintosh. Larger Hadoop production deployments are mostly on Cent OS, Red hat etc.
GNU/Linux is using as the development and production platform. Hadoop has been demonstrated on Linux clusters with more than 4000 nodes.
Win32 can be used as a development platform, but is not used as a production platform. For developing cluster in windows, we need Cygwin.
Since Ubuntu is a common Linux distribution and with interfaces similar to Windows, we’ll describe the details of Hadoop deployment on Ubuntu, it is better using the latest stable versions of OS.
This document deals with the development of cluster using Ubuntu Linux platform. Version is 12.04.1 LTS 64 bit.
- Java JDK
The recommended and tested versions of java are listed below, you can choose any of the following
Jdk 1.6.0_20 Jdk 1.6.0_21 Jdk 1.6.0_24 Jdk 1.6.0_26 Jdk 1.6.0_28 Jdk 1.6.0_31
*Source Apache Software Foundation wiki. Test resukts announced by Cloudera,MapR,HortonWorks
- SSH must be installed.
- SSHD must be running.
This is used by the Hadoop scripts to manage remote Hadoop daemons.
- Download a latest stable version of Hadoop.
Here we are using Hadoop 1.0.3.
Now we are ready with a Linux machine and required softwares. So we can start the set up. Open the terminal and follow the steps described below
Checking whether the OS is 64 bit or 32 bit
>$ uname –m
If it is showing a 64, then all the softwares(Java, ssh) must be of 64 bit. If it is showing 32, then use the softwares for 32 bit. This is very important.
For setting up hadoop, we need java. It is recommended to use sun java 1.6.
For checking whether the java is already installed or not
>$ java –version
This will show the details about java, if it is already installed.
If it is not there, we have to install.
Download a stable version of java as described above.
The downloaded file may be .bin file or .tar file
For installing a .bin file, go to the directory containing the binary file.
>$ sudo chmod u+x <filename>.bin >$ ./<filename>.bin
If it is a tar ball
>$ sudo chmod u+x <filename>.tar >$ sudo tar xzf <filename>.tar
Then set the JAVA_HOME in .bashrc file
Go to $HOME/.bashrc file
For editing .bashrc file
>$ sudo nano $HOME/.bashrc # Set Java Home export JAVA_HOME=<path from root to that java directory> export PATH=$PATH:$JAVA_HOME/bin
Now close the terminal, re-open again and check whether the java installation is correct.
>$ java –version
This will show the details, if java is installed correct.
Now we are ready with java installed.
Adding a user for using Hadoop
We have to create a separate user account for running Hadoop. This is recommended, because it isolates other softwares and other users on the same machine from hadoop installation.
>$ sudo addgroup hadoop >$ sudo adduser –ingroup hadoop user
Here we created a user “user” in a group “hadoop”.
In the following steps, If you are not able to do sudo with user.
Then add user to sudoers group.
>$ sudo nano /etc/sudoers
Then add the following
%user ALL= (ALL)ALL
This will give user the root privileges.
If you are not interested in giving root privileges, edit the line in the sudoers file as below
# Allow members of group sudo to execute any command %sudo ALL=(ALL:ALL) ALL
Installing SSH server.
Hadoop requires SSH access to manage the nodes.
In case of multinode cluster, it is remote machines and local machine.
In single node cluster, SSH is needed to access the localhost for user user.
If ssh server is not installed, install it before going further.
Download the correct version (64bit or 32 bit) of open-ssh-server.
Here we are using 64 bit OS, So I downloaded open ssh server for 64 bit.
The download link is
The downloaded file may be a .deb file.
For installing a .deb file
>$ sudo chmod u+x <filename>.deb >$ sudo dpkg –I <filename>.deb
This will install the .deb file.
Now we have SSH up and running.
As the first step, we have to generate an SSH key for the user
<div> user@ubuntu:~$ su - user user@ubuntu:~$ ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key (/home/user/.ssh/id_rsa): Created directory '/home/user/.ssh'. Your identification has been saved in /home/user/.ssh/id_rsa. Your public key has been saved in /home/user/.ssh/id_rsa.pub. The key fingerprint is: 9d:47:ab:d7:22:54:f0:f9:b9:3b:64:93:12:75:81:27user@ubuntu The key’s randomart image is: [........] user@ubuntu:~$
Here it is needed to unlock the key without our interaction, so we are creating an RSA keypair with an empty password. This is done in the second line. If empty password is not given, we have to enter the password every time when Hadoop interacts with its nodes. This is not desirable, so we are giving empty password.
The next step is to enable SSH access to our local machine with the key created in the previous step.
user@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys </div>
The last step is to test SSH setup by connecting to our local machine with user. This step is necessary to save our local machine’s host key fingerprint to the useruser’sknown_hosts file.
user@ubuntu:~$ sshlocalhost The authenticity of host 'localhost (127.0.0.1)' can't be established. RSA key fingerprint is 76:d7:61:86:ea:86:8f:31:89:9f:68:b0:75:88:52:72. Are you sure you want to continue connecting (yes/no)? yes Warning: Permanently added 'localhost' (RSA) to the list of known hosts. Ubuntu 12.04.1 ... user@ubuntu:~$
There is no use in enabling IPv6 on our Ubuntu Box, because we are not connected to any IPv6 network. So we can disable IPv6. The performance may vary.
For disabling IPv6 on Ubuntu , go to
>$ cd /etc/
Open the file sysctl.conf
>$ sudo nano sysctl.conf
Add the following lines to the end of this file
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
Reboot the machine to make the changes take effect
For checking whether IPv6 is enabled or not, we can use the following command.
>$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6
If the value is ‘0’ , IPv6 is enabled.
If it is ‘1’ , IPv6 is disabled.
We need the value to be ‘1’.
The requirements for installing Hadoop is ready. So we can start hadoop installation.
Right now the latest stable version of Hadoop available is hadoop 1.0.3.
So we are using this tar ball.
We create a directory named ‘utilities’ in user.
Practically, you can choose any directory. It will be good if you are keeping a good and uniform directory structure while installation. It will be good and when you deal with multinode clusters.
>$ cd utilities >$ sudo tar -xvf hadoop-1.0.3.tar.gz >$ sudo chown –R user:hadoop hadoop-1.0.3
Here the 2nd line will extract the tar ball.
The 3rd line will the permission(ownership)of hadoop-1.0.3 to user
Setting HADOOP_HOME in $HOME/.bashrc
Add the following lines in the .bashrc file
# Set Hadoop_Home export HADOOP_HOME=/home/user/utilities/hadoop-1.0.3 # Adding bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
Note: If you are editing this $HOME/.bashrc file, the user doing this only will get the benefit.
For making this affect globally to all users,
go to /etc/bash.bashrc file and do the same changes.
Thus JAVA_HOME and HADOOP_HOME will be available to all users.
Do the same procedure while setting java also.
In hadoop, we can find three configuration files core-site.xml, mapred-site.xml, hdfs-site.xml.
If we open this files, the only thing we can see is an empty configuration tag <configuration></configuration>
What actually happening behind the curtain is that, hadoop assumes default value to a lot of properties. If we want to override that, we can edit these configuration files.
The default values are available in three files
core-default.xml, mapred-default.xml, hdfs-default.xml
These are available in the locations
utilities/hadoop-1.0.3/src/core, utilities/hadoop-1.0.3/src/mapred, utilities/hadoop-1.0.3/src/hdfs.
If we open these files, we can see all the default properties.</pre>
Setting JAVA_HOME for hadoop directly
Open hadoop-env.sh file, you can see a JAVA_HOME with a path.
The location of hadoop-env.sh file is
Edit that JAVA_HOME and give the correct path in which java is installed.
>$ sudo nano hadoop-1.0.3/conf/hadoop-env.sh
#The Java Implementation to use export JAVA_HOME=<path from root to java directory>
Editting the Configuration files
All these files are present in the directory
Here we are configuring the directory where the hadoop stores its data files, the network ports is listens to…etc
By default Hadoop stores its local file system and HDFS in hadoop.tmp.dir .
Here we are using the directory /app/hadoop/tmp for storing temparory directories.
For that create a directory and set the ownership and permissions to user
>$ sudo mkdir –p /app/hadoop/tmp >$ sudo chownuser:hadoop /app/hadoop/tmp >$ sudo chmod 750 /app/hadoop/tmp
Here the first line will create the directory structure.
Second line will give the ownership of that directory to user
The third line will set the rwx permissions.
Setting the ownership and permission is very important, if you forget this, you will get into some exceptions while formatting the namenode.
Open the core-site.xml file, you can see empty configuration tags.
Add the following lines between the configuration tags.
<property> <name>hadoop.tmp.dir</name> <value>/app/hadoop/tmp</value> <description> A base for other temporary directories. </description> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> <description>The name of the default file system.</description> </property>
In the mapred-site.xml add the following between the configuration tags.
<property> <name>mapred.job.tracker</name> <value>localhost:9001</value> <description> The host and port that the MapReduce job tracker runs </description> </property>
In the hdfs-site.xml add the following between the configuration tags.
<property> <name>dfs.replication</name> <value>1</value> <description>Default block replication</description> </property>
Here we are giving replication as 1, because we have only one machine.
We can increase this as the number of nodes increases.
Formatting the Hadoop Distributed File System via NameNode.
The first step for starting our Hadoop installation is to format the distributed file system. This should be done before first use. Be careful that, do not format an already running cluster, because all the data will be lost.
user@ubuntu:~$ $HADOOP_HOME/bin/hadoop namenode –format
The output will look like this
09/10/12 12:52:54 INFO namenode.NameNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = ubuntu/127.0.1.1 STARTUP_MSG: args = [-format] STARTUP_MSG: version = 0.20.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.3 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010 ************************************************************/ 09/10/12 12:52:54 INFO namenode.FSNamesystem: fsOwner=user,hadoop 09/10/12 12:52:54 INFO namenode.FSNamesystem: supergroup=supergroup 09/10/12 12:52:54 INFO namenode.FSNamesystem: isPermissionEnabled=true 09/10/12 12:52:54 INFO common.Storage: Image file of size 96 saved in 0 seconds. 09/10/12 12:52:54 INFO common.Storage: Storage directory .../hadoop-user/dfs/name has been successfully formatted. 09/10/12 12:52:54 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1 ************************************************************/
Starting Our single-node Cluster
Here we have only one node. So all the hadoop daemons are running on a single machine.
So we can start all the daemons by running a shell script.
This willstartup all the hadoop daemonsNamenode, Datanode, Jobtracker and Tasktracker on our machine.
The output when we run this is shown below.
user@ubuntu:/home/user/utilities/hadoop-1.0.3$ bin/start-all.sh startingnamenode, logging to /home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-namenode-ubuntu.out localhost: starting datanode, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-datanode-ubuntu.out localhost: starting secondarynamenode, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-secondarynamenode-ubuntu.out startingjobtracker, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-jobtracker-ubuntu.out localhost: starting tasktracker, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-tasktracker-ubuntu.out user@ubuntu$
You can check the process running on the by using jps.
user@ubuntu:/home/user/utilities/hadoop-1.0.3$ jps 1127 TaskTracker 2339 JobTracker 1943 DataNode 2098 SecondaryNameNode 2378 Jps 1455 NameNode
Note: If jps is not working, you can use another linux command.
ps –ef | grepuser
You can check for each daemon also
ps –ef | grep<daemonname>eg:namenode
StoppingOur single-node Cluster
For stopping all the daemons running in the machine
Run the command
The output will be like this
user@ubuntu:~/utilities/hadoop-1.0.3$ bin/stop-all.sh stoppingjobtracker localhost: stopping tasktracker stoppingnamenode localhost: stopping datanode localhost: stopping secondarynamenode user@ubuntu:~/utilities/hadoop-1.0.3$
Then check with jps
>$jps 2378 Jps
Testing the set up
Now our installation part is complete
The next step is to test the installed set up.
Restart the hadoop cluster again by using start-all.sh
Checking with HDFS
- Make a directory in hdfs
</pre> </li> </ol> hadoop fs –mkdir /user/user/trial
If it is success list the created directory.
hadoop fs –ls /
The output will be like this
drwxr-xr-x - usersupergroup 0 2012-10-10 18:08 /user/user/trial
If getting like this, the HDFS is working fine.
- Copy a file from local linux file system
hadoop fs –copyFromLocal utilities/hadoop-1.0.3/conf/core-site.xml /user/user/trial/
Check for the file in HDFS
hadoop fs –ls /user/user/trial/ -rw-r--r-- 1 usersupergroup 557 2012-10-10 18:20 /user/user/trial/core-site.xml
If the output is like this, it is success.
Checking with a MapReduce job
Mapreduce jars for testing are available with the hadoop itself.
So we can use that jar. No need to import another.
For checking with mapreduce, we can run a wordcountmapreduce job.
Go to $HADOOP_HOME
>$hadoop jar hadoop-examples-1.0.3.jar
This output will be like this
An example program must be given as the first argument. Valid program names are: aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files. aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. dbcount: An example job that count the pageview counts from a database. grep: A map/reduce program that counts the matches of a regex in the input. join: A job that effects a join over sorted, equally partitioned datasets multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying program to find solutions to pentomino problems. pi: A map/reduce program that estimates Pi using monte-carlo method. randomtextwriter: A map/reduce program that writes 10GB of random textual data per node. randomwriter: A map/reduce program that writes 10GB of random data per node. secondarysort: An example defining a secondary sort to the reduce. sleep: A job that sleeps at each map and reduce task. sort: A map/reduce program that sorts the data written by the random writer. sudoku: A sudoku solver. teragen: Generate data for the terasort terasort: Run the terasort teravalidate: Checking results of terasort wordcount: A map/reduce program that counts the words in the input files.
The above shown are the programs that are contained inside that jar, we can choose any program.
Here we are going to run the wordcount process.
The input file using is the file that we already copied from local to HDFS.
Run the following commands for executing the wordcount
>$ hadoop jar hadoop-examples-1.0.3.jar wordcount user/user/trial/core-site.xml user/user/trial/output/ The output will be like this 12/10/10 18:42:30 INFO input.FileInputFormat: Total input paths to process : 1 12/10/10 18:42:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/10 18:42:30 WARN snappy.LoadSnappy: Snappy native library not loaded 12/10/10 18:42:31 INFO mapred.JobClient: Running job: job_201210041646_0003 12/10/10 18:42:32 INFO mapred.JobClient: map 0% reduce 0% 12/10/10 18:42:46 INFO mapred.JobClient: map 100% reduce 0% 12/10/10 18:42:58 INFO mapred.JobClient: map 100% reduce 100% 12/10/10 18:43:03 INFO mapred.JobClient: Job complete: job_201210041646_0003 12/10/10 18:43:03 INFO mapred.JobClient: Counters: 29 12/10/10 18:43:03 INFO mapred.JobClient: Job Counters 12/10/10 18:43:03 INFO mapred.JobClient: Launched reduce tasks=1 12/10/10 18:43:03 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12386 12/10/10 18:43:03 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/10/10 18:43:03 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/10/10 18:43:03 INFO mapred.JobClient: Launched map tasks=1 12/10/10 18:43:03 INFO mapred.JobClient: Data-local map tasks=1 12/10/10 18:43:03 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10083 12/10/10 18:43:03 INFO mapred.JobClient: File Output Format Counters 12/10/10 18:43:03 INFO mapred.JobClient: Bytes Written=617 12/10/10 18:43:03 INFO mapred.JobClient: FileSystemCounters 12/10/10 18:43:03 INFO mapred.JobClient: FILE_BYTES_READ=803 12/10/10 18:43:03 INFO mapred.JobClient: HDFS_BYTES_READ=688 12/10/10 18:43:03 INFO mapred.JobClient: FILE_BYTES_WRITTEN=44801 12/10/10 18:43:03 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=617 12/10/10 18:43:03 INFO mapred.JobClient: File Input Format Counters 12/10/10 18:43:03 INFO mapred.JobClient: Bytes Read=557 12/10/10 18:43:03 INFO mapred.JobClient: Map-Reduce Framework 12/10/10 18:43:03 INFO mapred.JobClient: Map output materialized bytes=803 12/10/10 18:43:03 INFO mapred.JobClient: Map input records=18 12/10/10 18:43:03 INFO mapred.JobClient: Reduce shuffle bytes=803 12/10/10 18:43:03 INFO mapred.JobClient: Spilled Records=90 12/10/10 18:43:03 INFO mapred.JobClient: Map output bytes=746 12/10/10 18:43:03 INFO mapred.JobClient: CPU time spent (ms)=3320 12/10/10 18:43:03 INFO mapred.JobClient: Total committed heap usage (bytes)=233635840 12/10/10 18:43:03 INFO mapred.JobClient: Combine input records=48 12/10/10 18:43:03 INFO mapred.JobClient: SPLIT_RAW_BYTES=131 12/10/10 18:43:03 INFO mapred.JobClient: Reduce input records=45 12/10/10 18:43:03 INFO mapred.JobClient: Reduce input groups=45 12/10/10 18:43:03 INFO mapred.JobClient: Combine output records=45 12/10/10 18:43:03 INFO mapred.JobClient: Physical memory (bytes) snapshot=261115904 12/10/10 18:43:03 INFO mapred.JobClient: Reduce output records=45 12/10/10 18:43:03 INFO mapred.JobClient: Virtual memory (bytes) snapshot=2876592128 12/10/10 18:43:03 INFO mapred.JobClient: Map output records=48 user@ubuntu:~/utilities/hadoop-1.0.3$
If the program executed successfully, the output will be in
user/user/trial/output/part-r-00000 file in hdfs
Check the output
>$hadoop fs –cat user/user/trial/output/part-r-00000
If output is coming, then our installation is success with mapreduce.
Thus we checked our installation.
So our single node hadoop cluster is ready
- For downloading hadoop tar ball.
- For downloading open-ssh server
- For downloading jdk 1.6