Hadoop Installation in Isolated Environments


Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. Hadoop was derived from Google’s MapReduce and Google File System(GFS) papers.

Hadoop development needs a hadoop cluster.
A trial hadoop cluster can be setted up in minutes. In my previous blog post, I have mentioned about hadoop cluster set up using tarball.

But for production environments, tarball installation is not a good approach.
Because it will become complex while installing other hadoop ecosystem components.
If we have a high speed internet connection, we can install hadoop directly from the internet using yum install in minutes.
But most of the production environments are isolated. So internet connection may not be there.
There we can perform this yum install by creating a local yum repository.
Creation of local yum repository is explained well in my previous blog “Creating A Local YUM Repository”
Yum install will work with REDHAT or CentOS linux distributions.
Here I am giving the explanation of Cloudera Distribution of Hadoop installation.



RedHat or CentOS (32 or 64 bit)


The ports necessary for hadoop should be opened. So we need to set appropriate firewall rules. The ports used by hadoop are listed in the last part of this post.

If you are not interested in, you can simply switch off the firewall.

The command for turning off the firewall is

service iptables stop


Sun java is required.

Download java from oracle website (32 or 64 bit depending on OS)

Install the java.

Simple installation and setting JAVA_HOME may not points the newly installed java as the default one.

It may still point to the openjdk if it is present.

So to point to the new java.

Do the following steps.

alternatives --config java

This will a list of java installed in the machine and to which java, it is currently pointing.

This will ask you to choose any java from the list.

Exit from this by pressing cntrl+c.

To add our sun java to this list. Do the following step.

/usr/sbin/alternatives --install /usr/bin/java java <JAVA_HOME>/bin/java 2

This will add our newly installed java to the list.

Then do

alternatives --config java

and choose the newly installed java. Now java –version will show sun java.


Download the Cloudera rpm repository from a place where you have internet access.

Download the repository corresponding to your OS version.
The repo file corresponding to different operating systems are listed below. Copy the repo file and download the repository.

For OS Version Click this Link
Red Hat/CentOS/Oracle 5 Red Hat/CentOS/Oracle 5
Red Hat/CentOS 6 (32-bit) Red Hat/CentOS/Oracle 6
Red Hat/CentOS 6 (64-bit) Red Hat/CentOS/Oracle 6

You can download it rpm by rpm or do a repo-sync.

Repo-sync is explained in my previous post Creating A Local YUM Repository.

Once this is done, create a local repository in one of the machines.

Then create a repo file corresponding to the newly created repository and add that repo file to all the cluster machines.

After this we can do yum install similar like a machine having internet access.

Now do a yum clean all in all the cluster machines.

Do the following steps in the corresponding nodes. The following steps explains the installation of MRV1 only.



yum install hadoop-hdfs-namenode


yum install hadoop-hdfs-secondarynamenode


yum install hadoop-hdfs-datanode


yum install hadoop-0.20-mapreduce-jobtracker


yum install hadoop-0.20-mapreduce-tasktracker


yum install hadoop-client

Normally we run datanode and tasktracker in the same node. Ie these are co-located for data locality.

Now edit the core-site.xml, mapred-site.xml and hdfs-site.xml in all the machines.


Sample configurations are given below.

But for production set up, you can set other properties.


<description>A Base dir for storing other temp directories</description>

<description>The name of default file system</description>


<description>Job Tracker port</description>

<description>local dir for mapreduce jobs</description>

  <description>The maximum number of map tasks that will be run
  simultaneously by a task tracker.
  <description>The maximum number of reduce tasks that will be run
  simultaneously by a task tracker.


<description>replication factor</description>

Then create hadoop.tmp.dir in all the machines. Hadoop stores its files in this folder.

Here we are using the location /app/hadoop/tmp

mkdir –p /app/hadoop/tmp

mkdir /app/hadoop/mapred_local

This type of installation automatically creates two users

1) hdfs

2) mapred

The directories should be owned by hdfs, so we need to change the ownership

chown –R hdfs:hadoop /app/hadoop/tmp

chmod –R 777 /app/hadoop/tmp

chown mapred:hadoop /app/hadoop/mapred_local

The properties

mapred.tasktracker.map.tasks.maximum :- this will set the number of map slots in each node.

mapred.tasktracker.reduce.tasks.maximum :- this will set the number of reduce slots in each node.

This number is set by doing a calculation on the available RAM and jvm size of each task slot. The default size of the task slot is 200 MB. So if you have 4GB RAM free after sharing to OS requirements and other processes. We can have 4*1024 MB/200 number of task slots in that node.

ie 4*1024/200 = 20

So we can have 20 task slots which we can divide into map slots and reduce slots.
Usually we give higher number of map slots than reduce slots.
In hdfs-site.xml we are giving the replication factor. Default value is 3.

Formatting Namenode

Now go to the namenode machine and login as root user.

Then from cli, switch to hdfs user.

su – hdfs

Then format the namenode.

hadoop namenode –format

Starting Services


In the namenode machine, execute the following command as root user.

/etc/init.d/hadoop-hdfs-namenode start

You can check whether the service is running or not by using the command jps.

Jps will work only if sun java is installed and added to path.


In the Secondary Namenode machine, execute the following command as root user.

/etc/init.d/hadoop-hdfs-secondarynamenode start


In the Datanode machines, execute the following command as root user.

/etc/init.d/hadoop-hdfs-datanode start

Now the hdfs is started. Now we will be able to execute hdfs tasks.


In the jobtracker machine login as root user, then switch to hdfs user.

su – hdfs

Then create the following hdfs directory structure and permissions.

hadoop fs –mkdir /app

hadoop fs –mkdir /app/hadoop

hadoop fs –mkdir /app/hadoop/tmp

hadoop fs –mkdir /app/hadoop/tmp/mapred

hadoop fs –mkdir /app/hadoop/tmp/mapred/staging

hadoop fs –chmod –R 1777 /app/hadoop/tmp/mapred/staging

hadoop fs –chown mapred:hadoop /app/hadoop/tmp/mapred

After doing this start the jobtracker

/etc/init.d/hadoop-0.20-mapreduce-jobtracker start


In the tasktracker nodes, start the tasktracker by executing the following command

/etc/init.d/hadoop-0.20-mapreduce-tasktracker start

You can check the namenode webUI using a browser.

The URL is


The Jobtracker web UI is


If hostname resolution is not happened correctly, the hostname:port may not work.

In these situtaions you can use http://ip-address:port

Now our hadoop cluster is ready for use.

With this method, we can create hadoop cluster of any size within short time.

Here we have the entire hadoop ecosystem repository, so installing other components such as hive, pig, Hbase, sqoop etc can be done very easily.

Ports Used By Hadoop


Creating A Local YUM Repository


People working on linux may be familiar with yum command.

Yum install <package name>  is a command that is used frequently for installing packages from a remote repository.

YUM stands for Yellowdog Update, Modifier.

YUM is a program that manages updates, installation and removal for RedHat package manager (RPM) systems.

Yum install will pick the repository url from /etc/yum.repos.d/ and download the package and install it in the machine.

Normally yum will work in machines having internet access. But if we want to install packages in isolated environments, normal yum install will not work, because the remote repository may not be accessible in the isolated environment.

In these cases, we have to set up a local yum repository.

Local repository is an exact copy of the remote repository that is made available in the isolated environment.

In most of the companies we need to set up a local repository for doing the yum installations.

Creating a local yum repository is very simple. This document helps you to create a local yum repository.


A RHEL or CentOS linux machine


1)      Installing and Starting Webserver  (httpd)

We need a webserver for creating a repository. So ensure that httpd is installed and running on the repository linux machine. If it is not there, download the httpd package and install it manually. If it is not started, start it manually.

rpm –ivh  <package-name>

then start the httpd service

/etc/init.d/httpd start  or service httpd start

2)      Creating a YUM Repository

For creating a repository, we need to install two packages

a)      createrepo

b)      yum-utils

Download and install these packages manually.

After this create a folder in with the name of your repository in /var/www/html/.

mkdir /var/www/html/<repo-name>

Note: the /var/www/html folder structure will be available only after starting httpd service.

Copy the packages that you want to be available in the repository as subfolders in the /var/www/html/<repo-name> directory.

After this go inside the <repo-name> directory and execute the following command.

$ createrepo .

This will create repodata of the repository.

Then try this repository from your web browser.

http://<ipaddress of webserver machine>/reponame/

You will be able to see all the packages. You can add as many packages into the repository. So that it can be made available in all machines using the url.

Note: These repository directory permissions should be set in a way that everybody can access. Better to give 755 permission.

3)      Creating a repo file

The packages present in the local repository can be made available to other machines by creating a repo file for this repository in the respective machines.

In the repo file, we are mentioning the url of this repository. This repo file is created in /etc/yum.repos.d/ directory.

Create a file <repo-name>.repo

Add the below contents to the file and save it.


name=Local Yum Repository

baseurl=http://<ipaddress of webserver machine>/<path to repo>

gpgcheck = 0

After adding the repo file, execute the following command.

$ yum clean all

This will clear all the temporary data saved about the previous repositories. That is just like refreshing.

With this you can create a repository and keep packages inside it and make it available to multiple machines. If you want a local copy of a remote repository, you can refer the steps below.

4)      Syncing a remote repository with the local repository

If you want to create a local copy of a remote repository, we can create it by using reposync command.

This needs internet connection, because we are copying the repository from a remote location to our local machine.

Create a repo file in /etc/yum.repos.d/ with the url of the remote repository


name=Remote repository


gpgkey=http://<gpg-key url>

gpgcheck = 1

Then create a directory for that repository, go to that directory

then do the following commands

yum clean all

reposync  –r  repo-name .

The repo-name is the repo-name mentioned in the repo file.

This will download the remote repository in our local machine. This will take time depending upon the size of the repository and speed of internet connection.

Copy this downloaded repository to /var/www/html/ directory and execute a createrepo, then this repository can be accessible from any machines (follow steps 2 and 3).