Hadoop Cluster Migrator

Hadoop Cluster Migrator tool provides a unified interface for copying data from one cluster to another cluster. Traditionally, DistCpy tool provided by Hadoop is used to migrate and copy the data from one Hadoop cluster to other. However, Distcpy works only when the connectivity between the source and target cluster is well established without any firewall rules blocking this connectivity. But in production scenarios, the edge node isolate the clusters from each other and Distcpy can’t be used for transfer of data or backup of the cluster. This is where Hadoop Cluster Migrator from Knowledge Lens can be very handy.

Hadoop Cluster Migrator is a cluster agnostic tool, which supports migration across different distribution, different version of Hadoop. Currently we support MapR, Cloudera, Hortworks, EMC Pivotal and Apache distribution of Hadoop, both in Kerberos enabled and disabled mode.

This completely Java based tool provides large scale data transfer between the cluster with transfer rate in the range of 10GB/s depending upon the bandwidth available. The tool is completely restartable and restarts from the point where the last transfer was stopped.

For more details refer : Hadoop Cluster Migrator

migrator

Advertisements

Migrating hive from one hadoop cluster to another cluster

Recently I have migrated a hive installation from one cluster to another cluster. I havent find any
document regarding this migration. So I did it with my experience and knowledge.
Hive stores the metadata in some databases, ie it stores the data about the tables in some database.
For developement/ production grade installations, we normally use mysql/oracle/postgresql databases.
Here I am explaining about the migration of the hive with its metastore database in mysql.
The metadata contains the information about the tables. The contents of the table are stored in hdfs.
So the metadata contains hdfs uri and other details. So if we migrate hive from one cluster to another
cluster, we have to point the metadata to the hdfs of new cluster. If we haven’t do this, it will point
to the hdfs of older cluster.

For migrating a hive installation, we have to do the following things.

1) Install hive in the new hadoop cluster
2) Transfer the data present in the hive metastore directory (/user/hive/warehouse) to the new hadoop
cluster
3) take the mysql metastore dump.
4) Install mysql in the new hadoop cluster
5) Open the hive mysql-metastore dump using text readers such as notepad, notepad++ etc and search for
hdfs://ip-address-old-namenode:port and replace with hdfs://ip-address-new-namenode:port and save it.

Where ip-address-old-namenode is the ipaddress of namenode of old hadoop cluster and ip-address-
new-namenode
is the ipaddress of namenode of new hadoop cluster.

6) After doing the above steps, restore the editted mysql dump into the mysql of new hadoop cluster.
7) Configure hive as normal and do the hive schema upgradations if needed.

This is a solution that I discovered when I faced the migration issues. I dont know whether any other
standard methods are available.
This worked for me perfectly. 🙂

Upgrading Hadoop Clusters

Last day me and my friends tried hadoop cluster upgrade.

We tried two upgrades and both were successful.

One was from cdh3u1 cluster to cdh-4.3.0 and other from cdh 4.1.2 to cdh 4.3.0.
For upgrading we need to upgrade the hadoop installation and the filesystem.
It was a nice experience.

The steps we followed are listed below.

First we checked the filesystem for missing blocks and created the report of the entire filesystem.

From the superuser (hdfs), we executed the command

hadoop dfsadmin –report  > reportold.log

hadoop  fsck  / >  fsckold.log

With this we will get the reports and status of the entire filesystem.

We can keep these for future comparison.

If there are any issues found in the report, do the necessary actions for making it proper.

If everything is fine, we can move futher with our upgrade process.

After this  we stopped all the processes.

For ensuring no accidental data loss, we backed up our namenode and datanode storage.

ie dfs.name.dir and dfs.data.dir.

After that we copied the hadoop configuration files and saved it in a different location for further use.

Then we uninstalled the entire hadoop installation(old version).

Care should be taken for keeping the contents of dfs.name.dir, dfs.data.dir secure.

We created a CDH 4.3.0 local repository and installed CDH 4.3.0 in all the machines similar to old version. The installation steps are mentioned in my previous posts.

Creating A Local YUM Repository

Hadoop Installation

Then we added the configuration files which we copied from the older installation previously.

We pointed the dfs.name.dir and dfs.data.dir to the correct locations.

After doing this, in the namenode machine, execute the following command.

/etc/init.d/hadoop-hdfs-namenode upgrade

Or

service hadoop-hdfs-namenode upgrade

This This will start the namenode and will upgrade the hadoop filesystem to the newer version.

After this, start all the other daemons and check whether everything is working fine or not.

Check the filesystem using the below commands (execute these commands from superuser)

hadoop dfsadmin –report  > reportnew.log

hadoop  fsck  /  >  fscknew.log

Compare the reportnew.log , fscknew.log with reportold.log and fsckold.log.

Note: If we are not satisfied with the upgrade, we can rollback to the previous version. This can be done by uninstalling the newer version and installing the older version  and execting the command

/etc/init.d/hadoop-hdfs-namenode rollback

This can be done only once and cannot do once the upgrade is finalized

If both the reports are same and if there is no problem of missing blocks, we can finalize our upgrade.

Stop all the daemons and execute the following command in the namenode machine

/etc/init.d/hadoop-hdfs-namenode finalizeUpgrade

Once the upgrade is finalized, we cannot rollback.

Note: From our experience we found that cdh4.1.2 filesystem and cdh4.3.0 filesystem are compatable. ie we found cdh-4.3.0 working properly by using the cdh4.1.2’s filesystem without executing the upgrade command.