Decommissioning a Datanode in a Hadoop cluster

Sometimes we may require to remove a node from a hadoop cluster without loosing the data.
For this we have to do the decommissioning procedures.
Decommisioning will exclude a node from the cluster after replicating the data present in the decommissioning node to the other active nodes.

The decommissioning is very simple. The steps are explained below.
First stop the tasktracker in the node to be decommissioned.
In the namenode machine add the below property to the hdfs-site.xml

<property>
<name>dfs.hosts.exclude</name>
<value>/etc/hadoop/conf/dfs.exclude</value>
</property>

where dfs.exclude is a file that we have to create and place it in a safe location. Better to keep it in HADOOP_CONF_DIR (/etc/hadoop/conf).

Create a file named dfs.exclude and add the hostnames of machines that need to be decommissioned line by line.

Eg: dfs.exclude

hostname1
hostname2
hostname3

After doing this, execute the following command from the superuser in the namenode machine.

hadoop dfsadmin -refreshNodes

After this, check the namenode UI. ie http://namenode:50070
You will be able to see the machines under decommissioning nodes.
The decommissioning process will take some time.
After the re-replication gets completed, the machine will be added to decommissioned nodes list.
After this, the decommissioned node can be safely removed from the cluster. πŸ™‚

Changing the Replication factor of an existing Hadoop cluster

I have an 8 node cluster. I faced an issue that the storage space was almost occupied and I want to get some more free space. So I had decided to reduce the replication factor from 3 to 2.
For that I editted the dfs.replication property in the hdfs-site.xml of all the nodes and restarted the hdfs. But this will set the replication to 2 only for the newly coming files. So inorder to change the entire existing cluster data to a replication factor to 2, run the following command from the superuser.

hadoop fs -setrep -R -w 2 /

After doing these steps, the entire hdfs data will be replicated twice only.
Similarly you can change the replication factor to any number. πŸ™‚

Big Data Trainings

Big Data has arrived!! If you are an IT professional who wish to change your career path to Big Data and become a Big Data Expert in a month, you have come to the right place.
We provide personalized Hadoop Training with hands on real life use-cases. Our mission is to ensure that you are a Big Data Expert within a month.

MyBigDataCoach provides expert professional coaching on Big Data Technologies, Data Science and Analytics.
Trainings are conducted by senior experts in Big Data Technologies.

We also provide Corporates Training on Big Data Technologies in India.

You can register for our online trainings on various Big Data Technologies.

Happy learnings!

For Further queries contact us at

Email: bigdatacoach@gmail.com
www.mybigdatacoach.com
Facebook : https://www.facebook.com/mybigdatacoach
Mobile:+91-9645191674

Pre requisites to attend

Basic knowledge of Basic Linux commands, Basic Core Java and writing SQL queries

Course Contents:

Day 1
ο‚’ Introduction to Big Data and Hadoop? (Common)
ο‚— Technology Landscape
ο‚— Why Big Data?
ο‚— Difference between Big Data and Traditional BI?
ο‚— Fundamentals about High Scalability.
ο‚— Distributed Systems and Challenges
ο‚— Key Fundamentals for Big Data
ο‚— Big Data Use Cases
ο‚— End to End production use case deployed for Hadoop
ο‚— When to use Hadoop and When not to?
Day 2
ο‚’ HDFS Fundamentals
ο‚— Fundamentals behind HDFS Design
ο‚— Key Characteristics of HDFS
ο‚— HDFS Daemons
ο‚— HDFS Commands
ο‚— Anatomy of File Read and Write in HDFS
ο‚— HDFS File System Metadata
ο‚— How replication happens in Hadoop
ο‚— How is replication strategy defined and how network topology can be defined?
ο‚— Limitations of HDFS
ο‚— When to use HDFS and when not to?
Day 3
ο‚’ Map Reduce Fundamentals
ο‚— What is Map-Reduce
ο‚— Examples of Map-Reduce Programs
ο‚— How to think in Map-Reduce
ο‚— What is feasible in Map-Reduce and What is not?
ο‚— End to End flow of Map-Reduce in Hadoop
Day 4
ο‚’ YARN
ο‚— Architecture Difference between MRV1 and YARN
ο‚— Introduction to Resource Manager
ο‚— Node Manager Responsibility
ο‚— Application Manager
ο‚— Proxy Server
ο‚— Job History Server
ο‚— Running map-reduce programs in YARN
Day 5 and Day 6
ο‚’ Hadoop Administration Part 1
ο‚— Hadoop Installation and Configuration
ο‚— YARN Installation and Configuration
ο‚— Hadoop Name Node
ο‚’ HDFS Name Node Metadata Structure
ο‚’ FSImage and Edit Logs
ο‚’ Viewing Name Node Metadata and Edit Logs
ο‚’ HDFS Name Node Federation
ο‚’ Federation and Block Pool ID
ο‚’ Tracing HDFS Blocks
ο‚— Name Node Sizing
ο‚’ Memory calculations for HDFS Metadata
ο‚’ Selecting the optimal Block Size
ο‚— Secondary Name Node
ο‚’ Checkpoint process in details
ο‚— Hadoop Map-Reduce
ο‚’ Tracing a Map-Reduce Execution from Admin View
ο‚’ Logs and History Viewer
Day 7
ο‚’ Hadoop Administration Part 2
ο‚— Hadoop Configurations
ο‚— High Availability of Name Node
ο‚— Configuring Hadoop Security
ο‚— NameNode Safemode and what are the conditions for namenode to be in Safemode?
ο‚— Name Node High Availability
ο‚— Distcp commands in Hadoop
ο‚— File Formats in Hadoop (RC, ORC, Sequence File, AVRO etc)
Day 8
ο‚’ Hadoop Ecosystem Components
ο‚— Role of each ecosystem components
ο‚— How does it all fit together
ο‚—
ο‚’ Hive
ο‚— Introduction
ο‚— Concepts on Meta-store
ο‚— Installation
ο‚— Configuration
ο‚— Basics of Hive
ο‚— What Hive cannot do?
ο‚— When to not use HIVE
Day 9
ο‚’ PIG
ο‚— Introduction
ο‚— Installation and Configuration
ο‚— Basics of PIG
ο‚— Hands on Example
Day 10
ο‚’ Oozie
ο‚— Introduction
ο‚— Installation and Configurations
ο‚— Running workflows in Oozie with HIVE, Map-Reduce, PIG, Sqoop
Day 11
ο‚’ Flume
ο‚— Introduction
ο‚— Installation and Configurations
ο‚— Running flume examples with HIVE , Hbase etc
Day 12
ο‚’ HUE
ο‚— Introduction
ο‚— HUE Installation and Configuration
ο‚— Using HUE
ο‚’ Zookeeper
ο‚— Introduction
ο‚— Installation and Configurations
ο‚— Examples in Zookeeper
ο‚’ Sqoop
ο‚— Introduction to Sqoop
ο‚— Installation and Configuration
ο‚— Examples for Sqoop
Day 13
ο‚’ Monitoring
ο‚— Monitoring Hadoop process
ο‚’ Hadoop Schedulers
ο‚— FIFO Scheduler
ο‚— Capacity Scheduler
ο‚— Fair Scheduler
ο‚— Difference between Fair and Capacity Schedulers
ο‚— Hands on with Scheduler Configuration
ο‚’ Cluster Planning and Sizing
ο‚— Hardware Selection Consideration
ο‚— Sizing
ο‚— Operating Systems Consideration
ο‚— Kernel Tuning
ο‚— Network Topology Design
ο‚— Hadoop Filesystem Quota
ο‚— Hands on with Few of Hadoop Tuning configurations
ο‚— Hands on Sizing a 100 TB Cluster
Day 14
ο‚’ Hadoop Maintenance
ο‚— Logging and Audit Trails
ο‚— File system Maintenance
ο‚— Backup and Restore
ο‚— DistCp
ο‚— Balancing
ο‚— Failure Handling
ο‚— Map-Reduce System Maintenance
ο‚— Upgrades
ο‚— Performance Benchmarking and Test
ο‚’ Hadoop Cluster Monitoring
ο‚— Installation of Nagios and Ganglia
ο‚— Configuring Nagios and Ganglia
ο‚— Collecting Hadoop Metrics
ο‚— REST interface for metrics collection
ο‚— JMX JSON Servlet
ο‚— Cluster Health Monitoring
ο‚— Configuring Alerts for Clusters
ο‚— Overall Cluster Health Monitoring
ο‚— Introduction to Cloudera Manager
Day 15
ο‚’ Advanced Developer for Hadoop
ο‚— Java API for HDFS Interactions
ο‚— File Read and Write to HDFS
ο‚— WebHDFS API and interacting with Hadoop using WebHDFS
ο‚— Different protocols used for interacting with HDFS
ο‚— Hadoop RPC and security around RPC
ο‚— Communication between Client and Data Node
ο‚— Hands on Examples with different file format write in HDFS
Day 16
ο‚’ Hadoop Map-Reduce API
ο‚— InputFormat and Record Readers
ο‚— Splittable and Non Splittable Files
ο‚— Mappers
ο‚— Combiners
ο‚— Patitioners
ο‚— Sorters
ο‚— Reducers
ο‚— OutputFormats and Record Writers
ο‚— Implementing custom Input Formats for PST and PDF
ο‚’ MapReduce Execution Framework
ο‚— Counters
ο‚— Inside MapReduce Daemons
ο‚— Failure Handling
ο‚— Speculative Execution
Day 17
ο‚’ Sqoop
ο‚— Difference between Sqoop and Sqoop2
ο‚— What are the various parameters in Export
ο‚— What the various parameters in Import
ο‚— Typical challenges with Sqoop operations
ο‚— How to tune Sqoop performance
Day 18 , 19
ο‚— MapReduce Examples and design patterns
ο‚— PIG UDF
ο‚— HIVE SerDe
ο‚— HIVE UDF, UDAF,UDTF
ο‚— Will be buffers for any spill over sessions!!
Day 20: Hadoop Design and Architecture
ο‚’ Security
ο‚— Security Design for HDFS
ο‚— Kerberos Fundamentals
ο‚— Setting up KDC
ο‚— Configuring Secured Hadoop Cluster
ο‚— Setting up Multi-realm authentication for Production Deployment
ο‚— Typical product deployment challenges with respect to Hadoop Security
ο‚— Role of HttpFS proxy for corporate firewalls
ο‚— Role of Cloudera Sentry and Knox
ο‚’ Common Failures and Problems
ο‚— File system related issues
ο‚— Map-Reduce related issues
ο‚— Maintenance related issues
ο‚— Monitoring related issues
Day 21
ο‚’ HIVE
ο‚— Hive UDF, UDAF, UDTF
ο‚— Writing custom UDF
ο‚— SerDe and role of SerDe
ο‚— Writing SerDe
ο‚— Advanced Analytical Functions
ο‚— Real Time Query
ο‚— Difference Stinger and Impala?
ο‚— Key Emerging Trends
ο‚— Implementing Updates and Deletes in HIVE
Day 22
ο‚’ PIG
ο‚— Architecture for PIG
ο‚— Advanced PIG Join Types
ο‚— Advanced PIG Latin Commands
ο‚— PIG Macros and their Limitations
ο‚— Typical Issues with PIG
ο‚— When to use PIG and When not to?
Day 23
ο‚’ Oozie
ο‚— Architecture and Fundamentals
ο‚— Installing and Configuring Oozie
ο‚— Oozie Workflows
ο‚— Coordinator Jobs
ο‚— Bundle Jobs
ο‚— Difference patterns in Oozie Scheduling
ο‚— How to troubleshoot in Oozie
ο‚— How to handle different libraries in Oozie
ο‚— Hands on example with Oozie
ο‚’ HUE
ο‚— Architecture and Fundamentals
ο‚— Installing and Configuring HUE
ο‚— Executing PIG, HIVE, Map-Reduce through HUE using Oozie
ο‚— Various features of HUE
ο‚— Integration of HUE users with Enterprise Identity Management systems
Day 24
ο‚’ Flume
ο‚— Flume Architecture
ο‚— Complex and Multiplexing Flows in Flume
ο‚— AVRO-RPC
ο‚— Configuring and running flume agents for the various supported sources (NetCat, JMS, Exec, Thrift, AVRO)
ο‚— Configuring and running flume agents with various supported sinks (HDFS, Logger, AVRO, Hbase, FileRoll, ElasticSearch etc)
ο‚— Understanding Batch load to HDFS
ο‚— Example with Flume in real project scenarios for
ο‚’ Log Analytics
ο‚’ Machine data collection with SNMP sources
ο‚’ Social Media Analytics
ο‚— Typical challenges with Flume operations
ο‚— Integration with HIVE and Hbase
ο‚— Implementing Custom Flume Sources and Sinks
ο‚— Flume Security with Kerberos
Day 25
ο‚’ Zookeeper
ο‚— Architecture
ο‚— High Scalability with Zookeeper
ο‚— Common Recipes with Zookeeper
ο‚— Leader Election
ο‚— Distributed Transaction Management
ο‚— Node Failure Detections and Cluster Membership management
ο‚— Co-ordination Services
ο‚— Cluster Deployment recipe with Zookeeper
ο‚— Typical challenges with Zookeeper operations
ο‚’ YARN
ο‚— YARN Architecture and Advanced Concepts
Day 26
ο‚’ End to End POC Design
ο‚— Live Example of end to end POC which has all ecosystem components

Simple Hive JDBC Client

Here I am explaining a sample hive jdbc client. With this we can fire hive queries from java programs. The only thing is that we need to start the hive server. By default, hive server listens at port 10000. The sample program is given below. The program is self explanatory and you can rewrite it to execute any type of hive queries. For this program you need the mysql-connector jar in the classpath.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

  /*
   * @author
   * Amal G Jose
   * 
   */

public class HiveJdbc
  private static String driver = "org.apache.hadoop.hive.jdbc.HiveDriver";

  /**
   * @param args
   * @throws SQLException
   */
  public static void main(String[] args) throws SQLException {
      try {
      Class.forName(driver);
    } catch (ClassNotFoundException e) {
      e.printStackTrace();
      System.exit(1);
    }

    Connection connect = DriverManager.getConnection("jdbc:hive://:10000/default", "", "");
    Statement state = connect.createStatement();
    String tableName = "test";
    state.executeQuery("drop table " + tableName);
    ResultSet res = state.executeQuery("create table " + tableName + " (key int, value string)");
   
   // Query to show tables
    String show = "show tables";
    System.out.println("Running: " + show);
    res = state.executeQuery(show);
    if (res.next()) {
      System.out.println(res.getString(1));
    }

    // Query to describe table
    String describe = "describe " + tableName;
    System.out.println("Running: " + describe);
    res = state.executeQuery(describe);
    while (res.next()) {
      System.out.println(res.getString(1) + "\t" + res.getString(2));
    }

  }
}

An Introduction to Apache Hive

Hive is an important member of hadoop ecosystem. It runs on top of hadoop. Β Hive uses a SQL type query language to process the data in hdfs. Hive is very simple as compared to writing several lines of mapreduce codes using programming languages such as Java. Hive was developed by facebook in a vision to support their SQL experts to handle big data without much difficulty.Β  Hive queries are easy to learn for people who don’t know any programming languages. Β People having experience in SQL can go straight forward with hive queries. The queries fired into hive will ultimately run as mapreduce.

Hive runs in two execution modes, local and distributed mode.

In local, the hive queries run as a single process and uses the local file system. In distributed mode, the mapper and reducer runs as different process and uses the hadoop distributed file system.

The installation of hive was explained well in my previous post Hive Installation.

Hive stores its contents in hdfs and table details (metadata) in some databases. By default the metadata is stored in derby database, but this is just for play around setups only and cannot be used for multiuser environments. For multiuser environments, we can use databases such as mysql, postgresql , oracle etc for storing the hive metadata. The data are stored in hdfs and it is contained in a location called hive warehouse directory which is defined by the property hive.metastore.warehouse.dir. By default this will be /user/hive/warehouse

We can fire queries into hive using a command line interface or using clients written in different programming languages. Hive server exposes a thrift service making hive accessible from various programming languages .

The simplicity and power of hive can be explained by comparing the word count program written in java program and in hive query.

The word count program written in java is well explained in my previous post A Simple Mapreduce Program – Wordcount . For that have to write a lot of lines of code and it will take time and it needs some good programming knowledge also.

The same word count can be done using hive query in a few lines of hive query.

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) word
GROUP BY word
ORDER BY word;

Hadoop Mapreduce- A Good Presentation…

This is a video Β I found from youtube that explains hadoop mapreduce very clearly using the wordcount example.

Migrating hive from one hadoop cluster to another cluster

Recently I have migrated a hive installation from one cluster to another cluster. I havent find any
document regarding this migration. So I did it with my experience and knowledge.
Hive stores the metadata in some databases, ie it stores the data about the tables in some database.
For developement/ production grade installations, we normally use mysql/oracle/postgresql databases.
Here I am explaining about the migration of the hive with its metastore database in mysql.
The metadata contains the information about the tables. The contents of the table are stored in hdfs.
So the metadata contains hdfs uri and other details. So if we migrate hive from one cluster to another
cluster, we have to point the metadata to the hdfs of new cluster. If we haven’t do this, it will point
to the hdfs of older cluster.

For migrating a hive installation, we have to do the following things.

1) Install hive in the new hadoop cluster
2) Transfer the data present in the hive metastore directory (/user/hive/warehouse) to the new hadoop
cluster
3) take the mysql metastore dump.
4) Install mysql in the new hadoop cluster
5) Open the hive mysql-metastore dump using text readers such as notepad, notepad++ etc and search for
hdfs://ip-address-old-namenode:port and replace with hdfs://ip-address-new-namenode:port and save it.

Where ip-address-old-namenode is the ipaddress of namenode of old hadoop cluster and ip-address-
new-namenode
is the ipaddress of namenode of new hadoop cluster.

6) After doing the above steps, restore the editted mysql dump into the mysql of new hadoop cluster.
7) Configure hive as normal and do the hive schema upgradations if needed.

This is a solution that I discovered when I faced the migration issues. I dont know whether any other
standard methods are available.
This worked for me perfectly. πŸ™‚