Advertisements

Unauthorized connection for super-user: hue from IP “x.x.x.x”

If you are getting the following error in hue,

Unauthorized connection for superuser: hue from IP “x.x.x.x”

Add the following property in the core-site.xml of your hadoop cluster and restart the cluster

<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>

You may face similar error with oozie also. In that case add a similar conf for oozie user in the core-sire.xml

<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>

Advertisements

Migrating hive from one hadoop cluster to another cluster

Recently I have migrated a hive installation from one cluster to another cluster. I havent find any
document regarding this migration. So I did it with my experience and knowledge.
Hive stores the metadata in some databases, ie it stores the data about the tables in some database.
For developement/ production grade installations, we normally use mysql/oracle/postgresql databases.
Here I am explaining about the migration of the hive with its metastore database in mysql.
The metadata contains the information about the tables. The contents of the table are stored in hdfs.
So the metadata contains hdfs uri and other details. So if we migrate hive from one cluster to another
cluster, we have to point the metadata to the hdfs of new cluster. If we haven’t do this, it will point
to the hdfs of older cluster.

For migrating a hive installation, we have to do the following things.

1) Install hive in the new hadoop cluster
2) Transfer the data present in the hive metastore directory (/user/hive/warehouse) to the new hadoop
cluster
3) take the mysql metastore dump.
4) Install mysql in the new hadoop cluster
5) Open the hive mysql-metastore dump using text readers such as notepad, notepad++ etc and search for
hdfs://ip-address-old-namenode:port and replace with hdfs://ip-address-new-namenode:port and save it.

Where ip-address-old-namenode is the ipaddress of namenode of old hadoop cluster and ip-address-
new-namenode
is the ipaddress of namenode of new hadoop cluster.

6) After doing the above steps, restore the editted mysql dump into the mysql of new hadoop cluster.
7) Configure hive as normal and do the hive schema upgradations if needed.

This is a solution that I discovered when I faced the migration issues. I dont know whether any other
standard methods are available.
This worked for me perfectly. ūüôā

Rhipe Installation

Rhipe was first developed by Saptarshi Guha.
Rhipe needs R and Hadoop. So first install R and hadooop. Installation of R and hadoop are well explained in my previous posts. The latest version of Rhipe as of now is Rhipe-0.73.1. and  latest available version of R is R-3.0.0. If you are using CDH4 (Cloudera distribution of hadoop) , use Rhipe-0.73 or later versions, because older versions may not work with CDH4.
Rhipe is an R and Hadoop integrated programming environment. Rhipe integrates R and Hadoop. Rhipe is very good for statistical and analytical calculations of very large data. Because here R is integrated with hadoop, so it will process in distributed mode, ie  mapreduce.
Futher explainations of Rhipe are available in http://www.datadr.org/

Prerequisites

Hadoop, R, protocol buffers and rJava should be installed before installing Rhipe.
We are installing Rhipe in a hadoop cluster. So the job submitted may execute in any of the tasktracker nodes. So we have to install R and Rhipe in all the tasktracker nodes, otherwise you will face an exception ‚ÄúCannot find R‚ÄĚ or something similar to that.

Installing Protocol Buffer

Download the protocol buffer 2.4.1 from the below link

http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz

tar -xzvf protobuf-2.4.1.tar.gz

cd protobuf-2.4.1

chmod -R 755 protobuf-2.4.1

./configure

make

make install

Set the environment variable PKG_CONFIG_PATH

nano /etc/bashrc

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

save and exit

Then executed the following commands to check the installation

pkg-config --modversion protobuf

This will show the version number 2.4.1
Then execute

pkg-config --libs protobuf

This will display the following things

-pthread -L/usr/local/lib -lprotobuf -lz ‚Äďlpthread

If these two are working fine, This means that the protobuf is properly installed.

Set the environment variables for hadoop

For example

nano /etc/bashrc

export HADOOP_HOME=/usr/lib/hadoop

export HADOOP_BIN=/usr/lib/hadoop/bin

export HADOOP_CONF_DIR=/etc/hadoop/conf

save and exit

Then


cd /etc/ld.so.conf.d/

nano Protobuf-x86.conf

/usr/local/lib   # add this value as the content of Protobuf-x86.conf

Save and exit

/sbin/ldconfig

Installing rJava

Download the rJava tarball from the below link.

http://cran.r-project.org/web/packages/rJava/index.html

The latest version of rJava available as of now is rJava_0.9-4.tar.gz

install rJava using the following command

R CMD INSTALL rJava_0.9-4.tar.gz

Installing Rhipe

Rhipe can be downloaded from the following link
https://github.com/saptarshiguha/RHIPE/blob/master/code/Rhipe_0.73.1.tar.gz

R CMD INSTALL Rhipe_0.73.1.tar.gz

This will install Rhipe

After this type R in the terminal

You will enter into R terminal

Then type

library(Rhipe)

#This will display

------------------------------------------------

| Please call rhinit() else RHIPE will not run |

————————————————

rhinit()

#This will display

Rhipe: Detected CDH4 jar files, using RhipeCDH4.jar
Initializing Rhipe v0.73
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Initializing mapfile caches

Now you can execute you Rhipe scripts.

R Installation in Linux Platforms

R is a free software that is for statistical and analytical computations. It is a very good tool for graphical computations also.
R is used in a wide range of areas. R allows us to carryout statistical analysis in an interactive model.
To use R, first we need to install R program in our computer. R can be installed windows, Linux, Mac OS etc.

In Linux platforms, we usually install by compiling the tarball.
The latest stable version of R as of now is R-3.0.0
Installation of R in Linux platforms is explained below.

Installation Using rpm

If you are using Redhat or CentOS distribution of linux, then you can either install using tarballs or rpm.
But latest versions may not be available as rpm.
Installation using rpm is simple.
Just download the rpm files with dependencies and install each using the command

rpm ‚Äďivh <rpm-name>

Installation Using yum

If you are having Internet connection,
Then installation is very simple.
Just do the following commands.

yum install R-core R-2*

Installing R using tarball

R is available as tarball which is compatible with all  linux platforms.
Latest versions of R are available as tarball.

The installation steps are given below.
Get the latest R tar file  for Linux from http://ftp.iitm.ac.in/cran/

Extract the tarball

tar¬† ¬†‚Äďxzvf ¬†¬†R-xxx.tar.gz

Change the permission of the extracted file

chmod ‚ÄďR 755 R-xxx

then go inside the extracted R directory and do the following steps

 cd   R.xxx
./configure  --enable-R-shlib

The above step may fail because of the lack of dependent libraries in your OS.
If it is failing, install the dependent libraries and do the above step again.
If this is done successfully, do the following steps.

make

make install

 

After this set the R_HOME and PATH in /etc/bashrc (Redhat or CentOS) or ~/.bashrc (if no root privilege) or /etc/bash.bashrc (Ubuntu)

export R_HOME= <path to R installation>
export PATH=$PATH:$R_HOME/bin

 

Then do the following command

source /etc/bashrc  (For Redhat or CentOS)

Or

source /etc/bash.bashrc    (If no root privilege)

Or

source ~/.bashrc   ( For Ubuntu)

 

Check R installation

Type R in your terminal

If R prompt is coming, your R installation is successful.

You can quit from R by using the command q()

Hadoop Installation in Isolated Environments

Introduction

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google’s MapReduce and Google File System(GFS) papers.

Hadoop development needs a hadoop cluster.
A trial hadoop cluster can be setted up in minutes. In my previous blog post, I have mentioned about hadoop cluster set up using tarball.

But for production environments, tarball installation is not a good approach.
Because it will become complex while installing other hadoop ecosystem components.
If we have a high speed internet connection, we can install hadoop directly from the internet using yum install in minutes.
But most of the production environments are isolated. So internet connection may not be there.
There we can perform this yum install by creating a local yum repository.
Creation of local yum repository is explained well in my previous blog ‚ÄúCreating A Local YUM Repository‚ÄĚ
Yum install will work with REDHAT or CentOS linux distributions.
Here I am giving the explanation of Cloudera Distribution of Hadoop installation.

Prerequisites

OPERATING SYSTEM

RedHat or CentOS (32 or 64 bit)

PORTS

The ports necessary for hadoop should be opened. So we need to set appropriate firewall rules. The ports used by hadoop are listed in the last part of this post.

If you are not interested in, you can simply switch off the firewall.

The command for turning off the firewall is

service iptables stop

JAVA

Sun java is required.

Download java from oracle website (32 or 64 bit depending on OS)

Install the java.

Simple installation and setting JAVA_HOME may not points the newly installed java as the default one.

It may still point to the openjdk if it is present.

So to point to the new java.

Do the following steps.

alternatives --config java

This will a list of java installed in the machine and to which java, it is currently pointing.

This will ask you to choose any java from the list.

Exit from this by pressing cntrl+c.

To add our sun java to this list. Do the following step.

/usr/sbin/alternatives --install /usr/bin/java java <JAVA_HOME>/bin/java 2

This will add our newly installed java to the list.

Then do

alternatives --config java

and choose the newly installed java. Now java ‚Äďversion will show sun java.

SETTING UP THE LOCAL HADOOP REPOSITORY

Download the Cloudera rpm repository from a place where you have internet access.

Download the repository corresponding to your OS version.
The repo file corresponding to different operating systems are listed below. Copy the repo file and download the repository.

For OS Version Click this Link
Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5
Red Hat/CentOS 6 (32-bit) Red Hat/CentOS/Oracle 6
Red Hat/CentOS 6 (64-bit) Red Hat/CentOS/Oracle 6

You can download it rpm by rpm or do a repo-sync.

Repo-sync is explained in my previous post Creating A Local YUM Repository.

Once this is done, create a local repository in one of the machines.

Then create a repo file corresponding to the newly created repository and add that repo file to all the cluster machines.

After this we can do yum install similar like a machine having internet access.

Now do a yum clean all in all the cluster machines.

Do the following steps in the corresponding nodes. The following steps explains the installation of MRV1 only.

Installation

NAMENODE MACHINE

yum install hadoop-hdfs-namenode

SECONDARY NAMENODE

yum install hadoop-hdfs-secondarynamenode

DATANODE

yum install hadoop-hdfs-datanode

JOBTRACKER

yum install hadoop-0.20-mapreduce-jobtracker

TASKTRACKER

yum install hadoop-0.20-mapreduce-tasktracker

IN ALL CLIENT MACHINES

yum install hadoop-client

Normally we run datanode and tasktracker in the same node. Ie these are co-located for data locality.

Now edit the core-site.xml, mapred-site.xml and hdfs-site.xml in all the machines.

Configurations

Sample configurations are given below.

But for production set up, you can set other properties.

core-site.xml


<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A Base dir for storing other temp directories</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://<namenode-hostname>:9000</value>
<description>The name of default file system</description>
</property>
</configuration>

mapred-site.xml

<configuration>
<property>
<name>mapred.job.tracker</name>
<value><jobtracker-hostname>:9001</value>
<description>Job Tracker port</description>
</property>

<property>
<name>mapred.local.dir</name>
<value>/app/hadoop/mapred_local</value>
<description>local dir for mapreduce jobs</description>
</property>


<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>6</value>
  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>
<property>
  <name>mapred.tasktracker.reduce.tasks.maximum</name>
  <value>2</value>
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.
  </description>
</property>
</configuration>

hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>replication factor</description>
</property>
</configuration>

Then create hadoop.tmp.dir in all the machines. Hadoop stores its files in this folder.

Here we are using the location /app/hadoop/tmp

mkdir ‚Äďp /app/hadoop/tmp

mkdir /app/hadoop/mapred_local

This type of installation automatically creates two users

1) hdfs

2) mapred

The directories should be owned by hdfs, so we need to change the ownership

chown ‚ÄďR hdfs:hadoop /app/hadoop/tmp

chmod ‚ÄďR 777 /app/hadoop/tmp

chown mapred:hadoop /app/hadoop/mapred_local

The properties

mapred.tasktracker.map.tasks.maximum :- this will set the number of map slots in each node.

mapred.tasktracker.reduce.tasks.maximum :- this will set the number of reduce slots in each node.

This number is set by doing a calculation on the available RAM and jvm size of each task slot. The default size of the task slot is 200 MB. So if you have 4GB RAM free after sharing to OS requirements and other processes. We can have 4*1024 MB/200 number of task slots in that node.

ie 4*1024/200 = 20

So we can have 20 task slots which we can divide into map slots and reduce slots.
Usually we give higher number of map slots than reduce slots.
In hdfs-site.xml we are giving the replication factor. Default value is 3.

Formatting Namenode

Now go to the namenode machine and login as root user.

Then from cli, switch to hdfs user.

su ‚Äď hdfs

Then format the namenode.

hadoop namenode ‚Äďformat

Starting Services

STARTING NAMENODE

In the namenode machine, execute the following command as root user.

/etc/init.d/hadoop-hdfs-namenode start

You can check whether the service is running or not by using the command jps.

Jps will work only if sun java is installed and added to path.

STARTING SECONDARY NAMENODE

In the Secondary Namenode machine, execute the following command as root user.

/etc/init.d/hadoop-hdfs-secondarynamenode start

STARTING DATANODE

In the Datanode machines, execute the following command as root user.

/etc/init.d/hadoop-hdfs-datanode start

Now the hdfs is started. Now we will be able to execute hdfs tasks.

STARTING JOBTRACKER

In the jobtracker machine login as root user, then switch to hdfs user.

su ‚Äď hdfs

Then create the following hdfs directory structure and permissions.

hadoop fs ‚Äďmkdir /app

hadoop fs ‚Äďmkdir /app/hadoop

hadoop fs ‚Äďmkdir /app/hadoop/tmp

hadoop fs ‚Äďmkdir /app/hadoop/tmp/mapred

hadoop fs ‚Äďmkdir /app/hadoop/tmp/mapred/staging

hadoop fs ‚Äďchmod ‚ÄďR 1777 /app/hadoop/tmp/mapred/staging

hadoop fs ‚Äďchown mapred:hadoop /app/hadoop/tmp/mapred

After doing this start the jobtracker

/etc/init.d/hadoop-0.20-mapreduce-jobtracker start

STARTING TASKTRACKER

In the tasktracker nodes, start the tasktracker by executing the following command

/etc/init.d/hadoop-0.20-mapreduce-tasktracker start

You can check the namenode webUI using a browser.

The URL is

http://<namenode-hostname>:50070

The Jobtracker web UI is

http://<jobtracker-hostname>:50030

If hostname resolution is not happened correctly, the hostname:port may not work.

In these situtaions you can use http://ip-address:port

Now our hadoop cluster is ready for use.

With this method, we can create hadoop cluster of any size within short time.

Here we have the entire hadoop ecosystem repository, so installing other components such as hive, pig, Hbase, sqoop etc can be done very easily.

Ports Used By Hadoop

blogports

Making a Pseudo distributed Hadoop Cluster.

Hadoop  Cluster

History of Hadoop

Hadoop is an open source framework written in java for processing mahout and complex  data sets in parallel. Doug Cutting, the developer of Hadoop named it after his son’s toy elephant. It evolved to support Lucene and Nutch after the release of a paper by Google about GFS in 2003. It works with unindexed, unstructured and unsorted data. The main parts of Hadoop are

HDFS (Hadoop Distributed File System)

HDFS is a

  1. Distributed :  Data is split into blocks and stored in different data nodes for faster execution by doing computation nearer to data.
  2. Scalable :  According to demand the resource can be scaled by distributing computation and storage across many servers. Data is broken down into blocks which can be executed in smaller chunks by map reduce programs , thus scalability can be increased

filesystem which stores metadata in NameNode and application data in DataNode. Fault tolerance is achieved by replication factor of 3.

Map Reduce

It is a framework used for parallel processing humungous data sets using clusters. It has two phases.

  1. Map phase : In this phase the data need to be separated out which need to be processed.
  2. Reduce phase : In this phase the data from the map phase is collected and analysis has to be done.

Cluster

A  cluster  is a  group of  computers  connected  via  a network. Similarly a Hadoop Cluster can also be a  combination of  a  number of  systems  connected  together  which  completes the picture of distributed computing. Hadoop uses  a master slave architecture.

Components  required  in the cluster

NameNodes

Name node is the master server of the cluster. It  doesnot store any file but knows where the blocks are stored in the child nodes and can give pointers and can re-assemble .Namenodes  comes up with  two  features  say Fsimage  and the edit log.FSImage   and edit log

Features

  1. Highly memory intensive
  2. Keeping it safe and isolated is necessary
  3. Manages the file system namespaces

DataNodes

Child nodes are attached to the main node.

Features:

  1. Data node  has  a configuration file to make itself  available in the cluster .Again they stores  data regarding storage capacity(Ex:5 out f 10 is available) of   that  particular data  node.
  2. Data nodes are independent ,since they are not pointing to any other data nodes.
  3. Manages the storage  attached to the  node.
  4. There  will be  multiple data nodes  in a cluster.

Job Tracker

  1. Schedules and assign task to the different datanodes.
  2. Work Flow
  3. Takes  the request.
  4. Assign the  task.
  5. Validate the requested work.
  6. Checks  whether  all the  data nodes  are working properly.
  7. If not, reschedule the tasks.

Task Tracker

Job Tracker and  task tracker   works   in  a master slave model. Every  datanode has got a  task tracker which  actually performs  the  task  which ever  assigned to it by the Job tracker.

Secondary Name Node

Secondaryname node  is not  a redundant  namenode but  this actually  provides  the  check pointing  and  housekeeping tasks  periodically.

Types of Hadoop Installations

  1. Standalone (local) mode:  It is used to run Hadoop directly on your local machine. By default Hadoop is configured to run in this mode. It is used for debugging purpose.
  2. Pseudo-distributed mode:  It is used to stimulate multi node installation using a single node setup. We can use a single server instead of installing Hadoop in different servers.
  3. Fully distributed mode:  In this mode Hadoop is installed in all the servers which is a part of the cluster. One machine need to be designated as NameNode and another one as JobTracker. The rest acts as DataNode and TaskTracker.

How to make a Single node Hadoop Cluster

A Single node cluster is a cluster where all the Hadoop daemons run on a single machine. The development can be described as several steps.

Prerequisites

OS Requirements

Hadoop is meant to be deployed on Linux based platforms which includes OS like Mackintosh. Larger Hadoop production deployments are mostly on Cent OS, Red hat etc.

GNU/Linux is using as the development and production platform. Hadoop has been demonstrated on Linux clusters with more than 4000 nodes.

Win32 can be used as a development platform, but is not used as a production platform. For developing cluster  in windows, we need Cygwin.

Since Ubuntu is a common Linux distribution and with interfaces similar to Windows, we’ll describe the details of Hadoop deployment on Ubuntu, it is better using the latest stable versions of OS.

This document deals with the development of cluster using Ubuntu Linux platform. Version is 12.04.1 LTS 64 bit.

Softwares Required

  • Java JDK

The recommended and tested versions of java are listed below, you can choose any of the following

Jdk 1.6.0_20

Jdk 1.6.0_21

Jdk 1.6.0_24

Jdk 1.6.0_26

Jdk 1.6.0_28

Jdk 1.6.0_31

*Source Apache Software Foundation wiki. Test resukts announced by Cloudera,MapR,HortonWorks

  • SSH must be installed.
  • SSHD must be running.

This is used by the Hadoop scripts to manage remote Hadoop daemons.

  • Download a latest stable version of Hadoop.

Here we are using Hadoop 1.0.3.

Now we are ready with a Linux machine and required softwares. So we can start the set up. Open the terminal and follow the steps described below

Step 1

Checking whether the OS is 64 bit or 32 bit


>$ uname ‚Äďm

If it is showing a 64, then all the softwares(Java, ssh) must be of 64 bit. If it is showing 32, then use the softwares for 32 bit. This is very important.

Step 2

Installing  Java.

For setting up hadoop, we need java. It is recommended to use sun java 1.6.

For checking whether the java is already installed or not


>$ java ‚Äďversion


This will show the details about java, if it is already installed.

If it is not there, we have to install.

Download a stable version of java as described above.

The downloaded file may be .bin file or .tar file

For installing a .bin file, go to the directory containing the binary file.


>$ sudo chmod u+x <filename>.bin

>$ ./<filename>.bin


If it is a tar ball


>$ sudo chmod u+x <filename>.tar

>$ sudo tar xzf <filename>.tar


Then set the JAVA_HOME in .bashrc file

Go to $HOME/.bashrc file

For editing .bashrc file


>$ sudo nano $HOME/.bashrc

# Set Java Home

export JAVA_HOME=<path from root to that java directory>

export PATH=$PATH:$JAVA_HOME/bin

Now close the terminal, re-open again and check whether the java installation is correct.


>$ java ‚Äďversion

This will show the details, if java is installed correct.

Now we are ready with java installed.

Step 3

Adding a user for using Hadoop

We have to create a separate user account for running Hadoop. This is recommended, because it isolates other softwares and other users on the same machine from hadoop installation.


>$ sudo addgroup hadoop

>$ sudo adduser ‚Äďingroup hadoop user

Here we created a user ‚Äúuser‚ÄĚ in a group ‚Äúhadoop‚ÄĚ.

Step 4

In the following steps,  If you are not able to do sudo with user.

Then add user to sudoers group.

For that


>$ sudo nano /etc/sudoers

Then add the following


%user ALL= (ALL)ALL

This will give user the root privileges.

If you are not interested in giving root privileges, edit the line in the sudoers file as below


# Allow members of group sudo to execute any command

%sudo   ALL=(ALL:ALL) ALL

Step 5

Installing SSH server.

Hadoop requires SSH access to manage the nodes.

In case of multinode cluster, it is remote machines and local machine.

In single node cluster, SSH is needed to access the localhost for user user.

If ssh server is not installed, install it before going further.

Download the correct version (64bit or 32 bit) of open-ssh-server.

Here we are using 64 bit OS, So I downloaded open ssh server for 64 bit.

The download link is

http://www.ubuntuupdates.org/package/core/precise/main/base/openssh-server

The downloaded file may be a .deb file.

For installing a .deb file


>$ sudo chmod u+x <filename>.deb

>$ sudo dpkg ‚ÄďI <filename>.deb

This will install the .deb file.

Step 6

Configuring SSH

Now we have SSH up and running.

As the first step, we have to generate an SSH key for the user

<div>

user@ubuntu:~$ su - user

user@ubuntu:~$ ssh-keygen -t rsa -P ""

Generating public/private rsa key pair.

Enter file in which to save the key (/home/user/.ssh/id_rsa):

Created directory '/home/user/.ssh'.

Your identification has been saved in /home/user/.ssh/id_rsa.

Your public key has been saved in /home/user/.ssh/id_rsa.pub.

The key fingerprint is:

9d:47:ab:d7:22:54:f0:f9:b9:3b:64:93:12:75:81:27user@ubuntu

The key’s randomart image is:

[........]

user@ubuntu:~$

Here it is needed to unlock the key without our interaction, so we are creating an RSA keypair with an empty password. This is done in the second line. If empty password is not given, we have to enter the password every time when Hadoop interacts with its nodes. This is not desirable, so we are giving empty password.

The next step is to enable SSH access to our local machine with the key created in the previous step.


user@ubuntu:~$ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

</div>

The last step is to test SSH setup by connecting to our local machine with user. This step is necessary to save our local machine’s host key fingerprint to the useruser’sknown_hosts file.


user@ubuntu:~$ sshlocalhost

The authenticity of host 'localhost (127.0.0.1)' can't be established.

RSA key fingerprint is 76:d7:61:86:ea:86:8f:31:89:9f:68:b0:75:88:52:72.

Are you sure you want to continue connecting (yes/no)? yes

Warning: Permanently added 'localhost' (RSA) to the list of known hosts.

Ubuntu 12.04.1

...

user@ubuntu:~$

Step 7

Disabling IPv6

There is no use in enabling IPv6 on our Ubuntu Box, because we are not connected to any IPv6 network. So we can disable IPv6. The performance may vary.

For disabling IPv6 on Ubuntu , go to


>$ cd /etc/

Open the file sysctl.conf


>$ sudo nano sysctl.conf

Add the following lines to the end of this file


#disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

Reboot the machine to make the changes take effect

For checking whether IPv6 is enabled or not, we can use the following command.


>$ cat  /proc/sys/net/ipv6/conf/all/disable_ipv6

If the value is ‚Äė0‚Äô , IPv6 is enabled.

If it is ‚Äė1‚Äô , IPv6 is disabled.

We need the value to be ‚Äė1‚Äô.

The requirements for installing Hadoop is ready. So we can start hadoop installation.

Step 8

Hadoop Installation

Right now the latest stable version of Hadoop available is hadoop 1.0.3.

So we are using this tar ball.

We create a directory named ‚Äėutilities‚Äô in user.

Practically, you can choose any directory. It will be good if you are keeping a good and uniform directory structure while installation. It will be good and when you deal with multinode clusters.


>$ cd utilities

>$ sudo tar -xvf  hadoop-1.0.3.tar.gz

>$ sudo ¬† chown ‚ÄďR user:hadoop hadoop-1.0.3

Here the 2nd line will extract the tar ball.

The 3rd line will the permission(ownership)of hadoop-1.0.3 to user

Step 9

Setting HADOOP_HOME in $HOME/.bashrc

Add the following lines in the .bashrc file


# Set Hadoop_Home

export HADOOP_HOME=/home/user/utilities/hadoop-1.0.3

# Adding bin/ directory to PATH

export PATH=$PATH:$HADOOP_HOME/bin

Note: If you are editing this $HOME/.bashrc  file, the user doing this only will get the benefit.

For making this affect globally to all users,

go to /etc/bash.bashrc file  and do the same changes.

Thus JAVA_HOME and HADOOP_HOME will be available to all users.

Do the same procedure while setting java also.

Step 10

Configuring Hadoop

In hadoop, we can find three configuration files core-site.xml, mapred-site.xml, hdfs-site.xml.

If we open this files, the only thing we can see is an empty configuration tag <configuration></configuration>

What actually happening behind the curtain is that, hadoop assumes default value to a lot of properties. If we want to override that, we can edit these configuration files.

The default values are available in three files

core-default.xml, mapred-default.xml, hdfs-default.xml

These are available in the locations

utilities/hadoop-1.0.3/src/core, utilities/hadoop-1.0.3/src/mapred,

utilities/hadoop-1.0.3/src/hdfs.

If we open these files, we can see all the default properties.</pre>
Setting JAVA_HOME for hadoop directly

Open hadoop-env.sh file, you can see a JAVA_HOME with a path.

The location of hadoop-env.sh file is

hadoop-1.0.3/conf/hadoop-env.sh

Edit that JAVA_HOME and give the correct path in which java is installed.


>$ sudo  nano hadoop-1.0.3/conf/hadoop-env.sh
#The Java Implementation to use

export JAVA_HOME=<path from root to java directory>

Editting the Configuration files

All these files are present in the directory

hadoop-1.0.3/conf/

Here we are configuring the directory where the hadoop stores its data files, the network ports is listens to…etc

By default Hadoop stores its local file system and HDFS in hadoop.tmp.dir .

Here we are using the directory /app/hadoop/tmp for storing  temparory directories.

For that create a directory and set the ownership and  permissions to user


>$¬† sudo ¬† mkdir ‚Äďp /app/hadoop/tmp

>$ sudo   chownuser:hadoop /app/hadoop/tmp

>$ sudo   chmod 750 /app/hadoop/tmp

Here the first line will create the directory structure.

Second line will give the ownership of that directory to user

The third line will set the rwx permissions.

Setting the ownership and permission is very important, if you forget this, you will get into some exceptions while formatting the namenode.

1.       Core-site.xml

Open the core-site.xml file, you can see empty configuration tags.

Add the following lines between the configuration tags.


<property>

<name>hadoop.tmp.dir</name>

<value>/app/hadoop/tmp</value>

<description>

A base for other temporary directories.

</description>

</property>

<property>

<name>fs.default.name</name>

<value>hdfs://localhost:9000</value>

<description>The name of the default file system.</description>

</property>

2.       Mapred-site.xml

In the mapred-site.xml add the following between the configuration tags.


<property>

<name>mapred.job.tracker</name>

 <value>localhost:9001</value>

 <description> The host and port that the MapReduce job tracker runs </description>

</property>
3.       Hdfs-site.xml

In the hdfs-site.xml add the following between the configuration tags.


<property>

<name>dfs.replication</name>

<value>1</value>

<description>Default block replication</description>

</property>

Here we are giving replication as 1, because we have only one machine.

We can increase this as the number of nodes increases.

Step 11

Formatting the Hadoop Distributed File System via  NameNode.

The first step for starting our Hadoop installation is to format the distributed file system. This should be done before first use. Be careful that, do not format an already running cluster, because all the data will be lost.


user@ubuntu:~$ $HADOOP_HOME/bin/hadoop namenode ‚Äďformat

The output will look like this


09/10/12 12:52:54 INFO namenode.NameNode: STARTUP_MSG:

/************************************************************

STARTUP_MSG: Starting NameNode

STARTUP_MSG:   host = ubuntu/127.0.1.1

STARTUP_MSG:   args = [-format]

STARTUP_MSG:   version = 0.20.2

STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.3 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010

************************************************************/

09/10/12 12:52:54 INFO namenode.FSNamesystem: fsOwner=user,hadoop

09/10/12 12:52:54 INFO namenode.FSNamesystem: supergroup=supergroup

09/10/12 12:52:54 INFO namenode.FSNamesystem: isPermissionEnabled=true

09/10/12 12:52:54 INFO common.Storage: Image file of size 96 saved in 0 seconds.

09/10/12 12:52:54 INFO common.Storage: Storage directory .../hadoop-user/dfs/name has been successfully formatted.

09/10/12 12:52:54 INFO namenode.NameNode: SHUTDOWN_MSG:

/************************************************************

SHUTDOWN_MSG: Shutting down NameNode at ubuntu/127.0.1.1

************************************************************/

Step 12

Starting Our single-node Cluster

Here we have only one node. So all the hadoop daemons are running on a single machine.

So we can start all the daemons by running a shell script.


user@ubuntu:~$ $HADOOP_HOME/bin/start-all.sh

This willstartup all the hadoop daemonsNamenode, Datanode, Jobtracker and Tasktracker on our machine.

The output when we run this is shown below.


user@ubuntu:/home/user/utilities/hadoop-1.0.3$ bin/start-all.sh

startingnamenode, logging to /home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-namenode-ubuntu.out

localhost: starting datanode, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-datanode-ubuntu.out

localhost: starting secondarynamenode, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-secondarynamenode-ubuntu.out

startingjobtracker, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-jobtracker-ubuntu.out

localhost: starting tasktracker, logging to home/user/utilities/hadoop-1.0.3/bin/../logs/hadoop-user-tasktracker-ubuntu.out

user@ubuntu$

You can check the process running on the by using jps.


user@ubuntu:/home/user/utilities/hadoop-1.0.3$ jps

1127 TaskTracker

2339 JobTracker

1943 DataNode

2098 SecondaryNameNode

2378 Jps

1455 NameNode

Note: If jps is not working, you can use another linux command.

ps ‚Äďef | grepuser

You can check for each daemon also

ps ‚Äďef | grep<daemonname>eg:namenode

Step 13

StoppingOur single-node Cluster

For stopping all the daemons running in the machine

Run the command


>$stop-all.sh

The output will be like this


user@ubuntu:~/utilities/hadoop-1.0.3$ bin/stop-all.sh

stoppingjobtracker

localhost: stopping tasktracker

stoppingnamenode

localhost: stopping datanode

localhost: stopping secondarynamenode

user@ubuntu:~/utilities/hadoop-1.0.3$

Then check with jps


>$jps

2378 Jps

Step 14

Testing the set up

Now our installation part is complete

The next step is to test the installed set up.

Restart the hadoop cluster again by using start-all.sh

Checking with HDFS
  1. Make a directory in hdfs
    </pre>
    </li>
    </ol>
    hadoop fs ‚Äďmkdir¬† /user/user/trial
    
    

    If it is success list the created directory.

    
    hadoop fs ‚Äďls /
    
    

    The output will be like this

    
    drwxr-xr-x   - usersupergroup  0 2012-10-10 18:08 /user/user/trial
    
    

    If getting like this, the HDFS is working fine.

    1. Copy a file from local linux file system
    
    hadoop fs ‚ÄďcopyFromLocal¬† utilities/hadoop-1.0.3/conf/core-site.xml¬† /user/user/trial/
    
    

    Check for the file in HDFS

    
    hadoop fs ‚Äďls /user/user/trial/
    
    -rw-r--r--   1 usersupergroup 557 2012-10-10 18:20 /user/user/trial/core-site.xml
    
    

    If the output is like this, it is success.

    Checking with a MapReduce job

    Mapreduce jars for testing are available with the hadoop itself.

    So we can use that jar. No need to import another.

    For checking with mapreduce, we can run a wordcountmapreduce job.

    Go to $HADOOP_HOME

    Then run

    
    >$hadoop jar hadoop-examples-1.0.3.jar
    
    

    This output will be like this

    
    An example program must be given as the first argument.
    
    Valid program names are:
    
    aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
    
    aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
    
    dbcount: An example job that count the pageview counts from a database.
    
    grep: A map/reduce program that counts the matches of a regex in the input.
    
    join: A job that effects a join over sorted, equally partitioned datasets
    
    multifilewc: A job that counts words from several files.
    
    pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
    
    pi: A map/reduce program that estimates Pi using monte-carlo method.
    
    randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
    
    randomwriter: A map/reduce program that writes 10GB of random data per node.
    
    secondarysort: An example defining a secondary sort to the reduce.
    
    sleep: A job that sleeps at each map and reduce task.
    
    sort: A map/reduce program that sorts the data written by the random writer.
    
    sudoku: A sudoku solver.
    
    teragen: Generate data for the terasort
    
    terasort: Run the terasort
    
    teravalidate: Checking results of terasort
    
    wordcount: A map/reduce program that counts the words in the input files.
    
    

    The above shown are the programs that are contained inside that jar, we can choose any program.

    Here we are  going to run the wordcount process.

    The input file using is the file that we already copied from local to HDFS.

    Run the following commands for executing the wordcount

    
    >$ hadoop jar hadoop-examples-1.0.3.jar wordcount user/user/trial/core-site.xml user/user/trial/output/
    
    The output will be like this
    
    12/10/10 18:42:30 INFO input.FileInputFormat: Total input paths to process : 1
    
    12/10/10 18:42:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    
    12/10/10 18:42:30 WARN snappy.LoadSnappy: Snappy native library not loaded
    
    12/10/10 18:42:31 INFO mapred.JobClient: Running job: job_201210041646_0003
    
    12/10/10 18:42:32 INFO mapred.JobClient:  map 0% reduce 0%
    
    12/10/10 18:42:46 INFO mapred.JobClient:  map 100% reduce 0%
    
    12/10/10 18:42:58 INFO mapred.JobClient:  map 100% reduce 100%
    
    12/10/10 18:43:03 INFO mapred.JobClient: Job complete: job_201210041646_0003
    
    12/10/10 18:43:03 INFO mapred.JobClient: Counters: 29
    
    12/10/10 18:43:03 INFO mapred.JobClient:   Job Counters
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Launched reduce tasks=1
    
    12/10/10 18:43:03 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=12386
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Launched map tasks=1
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Data-local map tasks=1
    
    12/10/10 18:43:03 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=10083
    
    12/10/10 18:43:03 INFO mapred.JobClient:   File Output Format Counters
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Bytes Written=617
    
    12/10/10 18:43:03 INFO mapred.JobClient:   FileSystemCounters
    
    12/10/10 18:43:03 INFO mapred.JobClient:     FILE_BYTES_READ=803
    
    12/10/10 18:43:03 INFO mapred.JobClient:     HDFS_BYTES_READ=688
    
    12/10/10 18:43:03 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=44801
    
    12/10/10 18:43:03 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=617
    
    12/10/10 18:43:03 INFO mapred.JobClient:   File Input Format Counters
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Bytes Read=557
    
    12/10/10 18:43:03 INFO mapred.JobClient:   Map-Reduce Framework
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Map output materialized bytes=803
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Map input records=18
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce shuffle bytes=803
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Spilled Records=90
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Map output bytes=746
    
    12/10/10 18:43:03 INFO mapred.JobClient:     CPU time spent (ms)=3320
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Total committed heap usage (bytes)=233635840
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Combine input records=48
    
    12/10/10 18:43:03 INFO mapred.JobClient:     SPLIT_RAW_BYTES=131
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce input records=45
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce input groups=45
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Combine output records=45
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Physical memory (bytes) snapshot=261115904
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Reduce output records=45
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=2876592128
    
    12/10/10 18:43:03 INFO mapred.JobClient:     Map output records=48
    
    user@ubuntu:~/utilities/hadoop-1.0.3$
    
    

    If the program executed successfully, the output will be in

    user/user/trial/output/part-r-00000 file in hdfs

    Check the output

    
    >$hadoop fs ‚Äďcat user/user/trial/output/part-r-00000
    
    

    If output is coming, then our installation is success with mapreduce.

    Thus we checked our installation.

    So our single node hadoop cluster is ready

    References

    1. For downloading hadoop tar ball.

    http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.0.3/

    1. For downloading open-ssh server

    http://www.ubuntuupdates.org/package/core/precise/main/base/openssh-server

    1. For downloading jdk 1.6

    http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html