Undeleting and purging KeyTrustee Key Provider methods via the REST interface

HDFS Data encryption is an excellent feature that came recently. With this we can encrypt the data in hdfs. We can create multiple encryption zones with different encryption keys. In this way, we can secure the data in hdfs properly. For more details, you can visit these websites. Reference1, Reference2

I am using a cluster installed with CDH. I created some encryption keys and zones.
The command I used for creating a key is given below.

# As the normal user, create a new encryption key
hadoop key create amalKey
 

# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /user/amal
hdfs crypto -createZone -keyName amalKey -path /user/amal
 

# chown it to the normal user
hadoop fs -chown amal:hadoop /user/amal
 

# As the normal user, put a file in, read it out
hadoop fs -put test.txt /user/amal/
hadoop fs -cat /user/amal/test.tx
 

After some days, I deleted the encryption zone and I deleted the encryption key also.
The command I used for deleting the encryption key is given below

hadoop key delete <key-name>

After the deletion, I tried creating the key with the same name. But I got an exception that the key is still present in the disabled state. When I list the keys, I am not able to see the key. The exception that I got was given below.

amalKey has not been created. java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
java.io.IOException: HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(HttpExceptionUtils.java:159)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:545)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.call(KMSClientProvider.java:503)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKeyInternal(KMSClientProvider.java:676)
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKey(KMSClientProvider.java:684)
at org.apache.hadoop.crypto.key.KeyShell$CreateCommand.execute(KeyShell.java:483)
at org.apache.hadoop.crypto.key.KeyShell.run(KeyShell.java:79)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.crypto.key.KeyShell.main(KeyShell.java:515)

In the error logs, it says to use purge option to permanently delete the key and undelete to recover the deleted key. But I was not able to find these options with hadoop key command. I googled it and I couldn’t figure out this issue. Finally I got the guidance from one guy from cloudera to execute the purge & undelete commands through rest api of keytrustee and he gave a nice explanation for my issue. I am briefly putting the solution for this exception below.

The delete operation on the Trustee key provider is a “soft delete”, meaning that is possible to “undelete” the key. It is also possible to “purge” the key to delete it permanently. Because these operations are not part of the standard Hadoop key provider API, they are not currently exposed through Hadoop KeyShell (hadoop key). However, you can call these operations directly via the Trustee key provider REST API.

See the examples below.

Use KeyShell to list existing keys

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use KeyShell to delete an existing key

$ ./bin/hadoop key delete amal-testkey-1 -provider kms://http@localhost:16000/kms
 
Deleting key: ajy-testkey-1 from KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1 has been successfully deleted.
KMSClientProvider[http://localhost:16000/kms/v1/] has been updated.

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to undelete the deleted key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

Use KeyShell to verify the key was restored

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1

Use the KeyTrustee key provider REST API to purge the restored key

$ curl L -d "trusteeOp=purge" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=purge"

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
 
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
 

Use the KeyTrustee key provider REST API to attempt to undelete the purged key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name=amal&trusteeOp=undelete"

{
"RemoteException" : {
"message" : "Key with name amal-testkey-1 not found in com.cloudera.keytrustee.TrusteeKeyProvider@6d88562",
"exception" : "IOException",
"javaClassName" : "java.io.IOException"
}
}

Configure ACLs for KeyTrustee undelete, purge and migrate operations

ACLs for the KeyTrustee specific undelete, purge and migrate operations are configured in kts-acls.xml. Place this file in the same location as your kms-acls.xml file. See example below.

<property>
   <name>keytrustee.kms.acl.UNDELETE</name>
     <value>*</value>
       <description>
          ACL for undelete-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.PURGE</name>
    <value>*</value>
      <description>
         ACL for purge-key operations.
      </description>
</property>
 
<property>
  <name>keytrustee.kms.acl.MIGRATE</name>
    <value>*</value>
     <description> 
      ACL for purge-key operations.
     </description>
</property>
 

Note: In kerberized environments, the requests will be a little different. It will be in the following format

Eg :
curl -L --negotiate -u [username]  -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?user.name={username}&trusteeOp=undelete"
Advertisements

Changing the data type mapping in sqoop

Sqoop is very helpful in importing data from RDBMS to hadoop. The hive import feature will create a hive table corresponding to the RDBMS table and import the data. By default sqoop creates a hive table based on the predefined data type conversion logic build inside sqoop. We have an option to change the default conversion. We can explicitly specify the data type required in  the hive table. This is possible by adding an extra option as below.

--map-column-java <mapping> Override mapping from SQL to Java type for configured columns.
--map-column-hive <mapping> Override mapping from SQL to Hive type for configured columns.

For example. If we have a field called ‘id‘ in an sql table which is of integer data type and we want it as a string data type column in hive, we can do the following step.

sqoop import ... --map-column-hive id=string

Introduction to Apache Spark

Big Data is very hot in market. Hadoop is one of the top rated technologies in big data. Hadoop became very popular in the market because of its elegant design, its capability to handle large structured/unstructured/semi-structured data and the better community support. Hadoop is a batch processing framework that can process data of any size. The only thing Hadoop guarantees is it will not fail because of load. Initially the requirement was to handle large data without any failure. This led to the design and development of frameworks such as hadoop. After that people started thinking about the performance improvements that can be made in this processing. This led to the development of a technology called spark.

Spark is an open source technology for processing large data in a distributed manner with some extra features compared to mapreduce. The processing speed of spark is higher than that of mapreduce. Most current cluster programming models are based on directed acyclic data flow from stable storage to stable storage. Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data. The main motivation behind the development of spark is because of the inefficient handling of two types of applications such as Iterative Algorithms and Interactive data mining tools in the current computing frameworks. With current frameworks, applications reload data from stable storage on each query. If the reload of the same data happens multiple times, it will consume more time. This affects the processing speed. If this happens in case of large data processing, the time loss will be high. If we store the intermediate results of a process in memory and share the in-memory copy of results across the cluster resources, the time delay will be less which will results in performance improvement. In this way we can say that the performance improvement is higher with in-memory computations. The inability to keep intermediate results in memory is one of the major drawback in most of the popular distributed data processing technologies. This is requirement of in-memory computation is mainly in iterative Algorithms and data mining applications. Spark achieves this in memory computation with RDDs. The back end of spark is RDD (Resilient Distributed Datasets).

RDD is a distributed memory abstraction that helps programmers to perform in-memory computation on very large clusters in an error free manner. An RDD is a read-only, partitioned collection of records. An RDD has enough information about how it was derived from other datasets. RDDs are immutable collections of objects spread across a cluster.

spark

Spark is rich with several features because of the modules build on spark.

  • Spark Streaming: processing real-time data streams
  • Spark SQL and DataFrames: support for structured data and relational queries
  • MLlib: built-in machine learning library
  • GraphX: Spark’s new API for graph processing
  • Bagel (Pregel on Spark): older, simple graph processing model

Is spark a replacement of hadoop ?

Spark is not a replacement for hadoop. It can work along with hadoop. It can use hadoop’s file system-HDFS as the storage layer. It can run on the existing hadoop cluster. Now spark became one of the most active projects in the hadoop ecosystem. The comparison happens only with the processing layer-Mapreduce. As per the current test results, spark is performing much better than mapreduce. Spark has several advantages of mapreduce. Spark is still under development and more features are coming up. The realtime stream processing is better in spark compared to other ecosystem components in hadoop. The detailed performance report of spark is available in the following url.

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Is Spark free or Licensed?

Spark is a 100 % open source project. Now it became an apache project with several committers and contributors across the world.

What are the attracting features in spark in comparison with Mapreduce ?

  • Spark is having Java, Scala and Python APIs.
  • Programming spark is simpler as compared to programming mapreduce. This reduces the development time.
  • The performance of spark is better compared to mapreduce. It is best suited for computations such as realtime processing, iterative computations etc on similar data.
  • Caching is one main feature in spark. Spark stores the intermediate result in memory across its distributed workers. Mapreduce stores the intermediate results on disk. The in memory caching feature of spark makes it faster. The spark streaming provides a realtime data processing feature on the fly of data flow which is missing in case of mapreduce.
  • With spark, it is possible to obtain batch processing, streaming processing, graph processing and machine learning in the same cluster. This provides better resource utilization and easy resource management.
  • Spark has an excellent feature of spilling the data partitions to disk if the node is not having sufficient RAM for storing the data partitions.
  • All these features made spark a very powerful member in the bigdata technology stack and this will be the one of the hottest technologies that is going to capture the market.

Rhipe on AWS (YARN and MRv1)

Rhipe is an R library that runs on top of hadoop. Rhipe is using hadoop streaming concept for running R programs in hadoop. To know more about Rhipe, please check my older post. My previous post on Rhipe was the basic explanation and the installation steps for running Rhipe in cdh4(MRv1). Now yarn became popular and almost everyone are using YARN. So a lot of people asked me assistance for installing Rhipe in YARN. Rhipe works on yarn very well. Here I am just giving a pointer on how to install Rhipe on AWS (Amazon Web Services). I checked this script and it is working fine. This contains the bootstrap script and installables that installs Rhipe automatically in AWS. For those who are new to AWS, I will explain the basics of AWS EMR and bootstrap script. Amazon Web Services are providing a lot of cloud services. Among that Elastic Mapreduce(EMR) is a service that provides a hadoop cluster. This is one of the best solution for users who don’t want to maintain a data center and don’t want to take the headaches of hadoop administration. AWS is providing a list of components for installing in the hadoop cluster. Those services we can choose while installing the hadoop cluster through the web console. Examples for such components are hive, pig, impala, hue, hbase, hunk etc. But in most of the cases, user may require some extra softwares also. This extra requirement depends on user. If the user try to install the extra service manually in the cluster, it will take lot of time. The automated cluster launch will take less than 10 minutes.( I tried for around 100 nodes). But if you install the software in all of these nodes manually, it will take several hours. For this problem, amazon is providing a solution. User can provide any custom shell scripts and these scripts will be executed on all the nodes while installing the hadoop. This script is called bootstrap script. Here we are installing Rhipe using a bootstrap script. For users who want to install Rhipe on  AWS Hadoop MRv1, you can follow this url. Please ensure that you are using the correct AMI. AMI is Amazon Machine Image. This is just a version of the image that they are providing. For those users who want to install Rhipe on AWS Hadoop MRv2 (YARN), you can follow this url. This will work perfectly on AWS AMI 3.2.1. You can download the github repo in your local and put it your S3. Then launch the cluster by specifying the details mentioned in the installation doc.

For non-aws users

For those users who want to install Rhipe on yarn (Any hadoop cluster), you can either build the Rhipe for their corresponding version of hadoop and put that jar inside Rhipe directory or you can directly try using the ready made rhipe for YARN. All the Rhipe versions are available in a common repository. You can download the installable from this location. You have to follow the steps mentioned in the all the shell scripts present in the given repository. This is a manual activity and you have to do this activity on all the nodes in your hadoop cluster.

GLens – An Environment Saviour

GLens
GLens enables Real Time Data Acquisition and Monitoring for Industrial Emissions, Effluent and Ambient pollution. Any analyzers make or model with RS232/RS485/RS222/Modbus is seamless integrated in a plug and play model. Provides pre-build reports,alarms and alerts as per the regulatory standards for Pollution monitoring.
The Ad-hoc reporting module provides capability to analyze the data upto 2 second granularity over the collected parameters and provide forecasting using Holt-Winters forecast models. GLens integrates with any analyzer (make or model) in a plug and play mode and enabling real time monitoring of the parameters instantaneously.
With open data exchange format any analyzer, manufacturer or industry could enable connectivity of these monitored parameters on to the platform.
For more details visit the following link GLens

Sample program with Hadoop Counters and Distributed Cache

Counters are very useful feature in hadoop. This helps us in tracking global events in our job, ie across map and reduce phases.
When we execute a mapreduce job, we can see a lot of counters listed in the logs. Other than the default built-in counters, we can create our own custom counters. The custom counters will be listed along with the built-in counters.
This helps us in several ways. Here I am explaining a scenario where I am using a custom counter for counting the number of good words and stop words in the given text files. The stop words in this program are provided at the run time using distributed cache.
This is a mapper only job. The property job.setNumReduceTasks(0) makes the it a mapper only job.

Here I am introducing another feature in hadoop called Distributed Cache.
Distributed cache will distribute application specific read only files efficiently through out the application.
My requirement is to filter the stop words from input text files. The stop words list may vary. So if I hard code the list in my program, I have to update the code everytime to make changes in the stop word list. This is not a good practice. I used distributed cache for this and the file containing the stop words is loaded to the distributed cache. This makes the file available to mapper as well as reducer. In this program, we don’t require any reducer.

The code is attached below. You can also get the code from the github.

Create a java project with the above java classes. Add the dependent java libraries.(Libraries will be present in your hadoop installation). Export the project as a runnable jar and execute. The file containing the stop words should be present in hdfs. The stop words should be added line by line in the stop word file. Sample format is given below.

is

the

am

are

with

was

were

Sample command to execute the program is given below.

hadoop jar <jar-name>  -skip  <stop-word-file-in hdfs>   <input-data-location>    <output-location>

Eg:  hadoop jar Skipper.jar  -skip /user/hadoop/skip/skip.txt     /user/hadoop/input     /user/hadoop/output

In the job logs, you can see the custom counters also. I am attaching a sample log below.

Counters

Hadoop Cluster Migrator

Hadoop Cluster Migrator tool provides a unified interface for copying data from one cluster to another cluster. Traditionally, DistCpy tool provided by Hadoop is used to migrate and copy the data from one Hadoop cluster to other. However, Distcpy works only when the connectivity between the source and target cluster is well established without any firewall rules blocking this connectivity. But in production scenarios, the edge node isolate the clusters from each other and Distcpy can’t be used for transfer of data or backup of the cluster. This is where Hadoop Cluster Migrator from Knowledge Lens can be very handy.

Hadoop Cluster Migrator is a cluster agnostic tool, which supports migration across different distribution, different version of Hadoop. Currently we support MapR, Cloudera, Hortworks, EMC Pivotal and Apache distribution of Hadoop, both in Kerberos enabled and disabled mode.

This completely Java based tool provides large scale data transfer between the cluster with transfer rate in the range of 10GB/s depending upon the bandwidth available. The tool is completely restartable and restarts from the point where the last transfer was stopped.

For more details refer : Hadoop Cluster Migrator

migrator

What is Big Data? Why Big Data?

Big data is becoming a very hot topic in the Industry and every companies are trying to open an account in providing a solution for this BigData.
Why this BigData became very popular..?
Technology became very advanced and now we have reached a situation where data are able to speak. That means, previously we used data to find the results of events happened previously. Now we are using data to predict events that are going to happen.
Consider a doctor who learned everything with his several years of education and practice. This means he is doing his consulting based on his experience.  If we build a system that learns from historic data and predict events based on its learning, it will be very helpful in a lot of areas. As the size of useful data (data that contains details) increases, the accuracy of the prediction also increases. If the data size exceeds some limits, it will be very difficult to process it using conventional data processing technologies. Here the problem of big data arises. Now a days the demand of insight generation and prediction using big data is very high. Because almost systems everything that interacts with people are now giving recommendations. These recommendations are given based on some analysis over previous data. The accuracy of the recommendations increases as the size of data.  For example, we are getting item recommendations from Amazon, flipkart, ebay, friend suggestions from facebook, google advertisements etc. All these are happening by processing large data. This increases the sales in case of business and usability in case of other systems.

For solving these big data problems, frameworks such as hadoop, storm, spark etc evolved.
Most of the big data solutions are the implementation of distributed storage, distributed processing, in-memory processing etc.
Now a lot of analytics are happening in social media. You can’t do a control+Z on things that you uploaded to social media.

Facebook Opensourced Presto

Facebook opensourced its data processing technique ‘Presto’ to the world. Presto is a distributed query engine based on ANSI SQL. It is very optimized and currently running with more than 300 petabytes of data, which may one among the top big data processing systems. Presto is a totally different from mapreduce. It is an in memory data processing mechanism and is very much optimised. From the details given in the facebook newsletter and presto website, it is 10 times faster than Hive. mapreduce.Hive came from facebook only, so presto will definitely beat hive. Hive queries are ultimately running as multiple mapreduce jobs and it will take more time. From my point of view, the competition may be between Cloudera Impala and Presto. Impala’s performance with huge datasets is not available now from any production environments because it is a budding technology from cloudera family, but presto is already tested and running in huge dataset production environment. Another interesting fact about presto is that we can use the already existing infrastructure and hadoop cluster for deploying presto, because presto supports hdfs as its underlying data storage. It supports other storage systems also. So it is flexible. Leading internet companies including Airbnb and Dropbox are using Presto. Presto code and further details are available in this link

I have deployed Presto and Impala on a small cluster of 8 nodes. I haven’t got enough time to explore more on presto. I am planning to explore more on the coming days. 🙂

Big Data Trainings

Big Data has arrived!! If you are an IT professional who wish to change your career path to Big Data and become a Big Data Expert in a month, you have come to the right place.
We provide personalized Hadoop Training with hands on real life use-cases. Our mission is to ensure that you are a Big Data Expert within a month.

MyBigDataCoach provides expert professional coaching on Big Data Technologies, Data Science and Analytics.
Trainings are conducted by senior experts in Big Data Technologies.

We also provide Corporates Training on Big Data Technologies in India.

You can register for our online trainings on various Big Data Technologies.

Happy learnings!

For Further queries contact us at

Email: bigdatacoach@gmail.com
www.mybigdatacoach.com
Facebook : https://www.facebook.com/mybigdatacoach
Mobile:+91-9645191674

Pre requisites to attend

Basic knowledge of Basic Linux commands, Basic Core Java and writing SQL queries

Course Contents:

Day 1
 Introduction to Big Data and Hadoop? (Common)
 Technology Landscape
 Why Big Data?
 Difference between Big Data and Traditional BI?
 Fundamentals about High Scalability.
 Distributed Systems and Challenges
 Key Fundamentals for Big Data
 Big Data Use Cases
 End to End production use case deployed for Hadoop
 When to use Hadoop and When not to?
Day 2
 HDFS Fundamentals
 Fundamentals behind HDFS Design
 Key Characteristics of HDFS
 HDFS Daemons
 HDFS Commands
 Anatomy of File Read and Write in HDFS
 HDFS File System Metadata
 How replication happens in Hadoop
 How is replication strategy defined and how network topology can be defined?
 Limitations of HDFS
 When to use HDFS and when not to?
Day 3
 Map Reduce Fundamentals
 What is Map-Reduce
 Examples of Map-Reduce Programs
 How to think in Map-Reduce
 What is feasible in Map-Reduce and What is not?
 End to End flow of Map-Reduce in Hadoop
Day 4
 YARN
 Architecture Difference between MRV1 and YARN
 Introduction to Resource Manager
 Node Manager Responsibility
 Application Manager
 Proxy Server
 Job History Server
 Running map-reduce programs in YARN
Day 5 and Day 6
 Hadoop Administration Part 1
 Hadoop Installation and Configuration
 YARN Installation and Configuration
 Hadoop Name Node
 HDFS Name Node Metadata Structure
 FSImage and Edit Logs
 Viewing Name Node Metadata and Edit Logs
 HDFS Name Node Federation
 Federation and Block Pool ID
 Tracing HDFS Blocks
 Name Node Sizing
 Memory calculations for HDFS Metadata
 Selecting the optimal Block Size
 Secondary Name Node
 Checkpoint process in details
 Hadoop Map-Reduce
 Tracing a Map-Reduce Execution from Admin View
 Logs and History Viewer
Day 7
 Hadoop Administration Part 2
 Hadoop Configurations
 High Availability of Name Node
 Configuring Hadoop Security
 NameNode Safemode and what are the conditions for namenode to be in Safemode?
 Name Node High Availability
 Distcp commands in Hadoop
 File Formats in Hadoop (RC, ORC, Sequence File, AVRO etc)
Day 8
 Hadoop Ecosystem Components
 Role of each ecosystem components
 How does it all fit together

 Hive
 Introduction
 Concepts on Meta-store
 Installation
 Configuration
 Basics of Hive
 What Hive cannot do?
 When to not use HIVE
Day 9
 PIG
 Introduction
 Installation and Configuration
 Basics of PIG
 Hands on Example
Day 10
 Oozie
 Introduction
 Installation and Configurations
 Running workflows in Oozie with HIVE, Map-Reduce, PIG, Sqoop
Day 11
 Flume
 Introduction
 Installation and Configurations
 Running flume examples with HIVE , Hbase etc
Day 12
 HUE
 Introduction
 HUE Installation and Configuration
 Using HUE
 Zookeeper
 Introduction
 Installation and Configurations
 Examples in Zookeeper
 Sqoop
 Introduction to Sqoop
 Installation and Configuration
 Examples for Sqoop
Day 13
 Monitoring
 Monitoring Hadoop process
 Hadoop Schedulers
 FIFO Scheduler
 Capacity Scheduler
 Fair Scheduler
 Difference between Fair and Capacity Schedulers
 Hands on with Scheduler Configuration
 Cluster Planning and Sizing
 Hardware Selection Consideration
 Sizing
 Operating Systems Consideration
 Kernel Tuning
 Network Topology Design
 Hadoop Filesystem Quota
 Hands on with Few of Hadoop Tuning configurations
 Hands on Sizing a 100 TB Cluster
Day 14
 Hadoop Maintenance
 Logging and Audit Trails
 File system Maintenance
 Backup and Restore
 DistCp
 Balancing
 Failure Handling
 Map-Reduce System Maintenance
 Upgrades
 Performance Benchmarking and Test
 Hadoop Cluster Monitoring
 Installation of Nagios and Ganglia
 Configuring Nagios and Ganglia
 Collecting Hadoop Metrics
 REST interface for metrics collection
 JMX JSON Servlet
 Cluster Health Monitoring
 Configuring Alerts for Clusters
 Overall Cluster Health Monitoring
 Introduction to Cloudera Manager
Day 15
 Advanced Developer for Hadoop
 Java API for HDFS Interactions
 File Read and Write to HDFS
 WebHDFS API and interacting with Hadoop using WebHDFS
 Different protocols used for interacting with HDFS
 Hadoop RPC and security around RPC
 Communication between Client and Data Node
 Hands on Examples with different file format write in HDFS
Day 16
 Hadoop Map-Reduce API
 InputFormat and Record Readers
 Splittable and Non Splittable Files
 Mappers
 Combiners
 Patitioners
 Sorters
 Reducers
 OutputFormats and Record Writers
 Implementing custom Input Formats for PST and PDF
 MapReduce Execution Framework
 Counters
 Inside MapReduce Daemons
 Failure Handling
 Speculative Execution
Day 17
 Sqoop
 Difference between Sqoop and Sqoop2
 What are the various parameters in Export
 What the various parameters in Import
 Typical challenges with Sqoop operations
 How to tune Sqoop performance
Day 18 , 19
 MapReduce Examples and design patterns
 PIG UDF
 HIVE SerDe
 HIVE UDF, UDAF,UDTF
 Will be buffers for any spill over sessions!!
Day 20: Hadoop Design and Architecture
 Security
 Security Design for HDFS
 Kerberos Fundamentals
 Setting up KDC
 Configuring Secured Hadoop Cluster
 Setting up Multi-realm authentication for Production Deployment
 Typical product deployment challenges with respect to Hadoop Security
 Role of HttpFS proxy for corporate firewalls
 Role of Cloudera Sentry and Knox
 Common Failures and Problems
 File system related issues
 Map-Reduce related issues
 Maintenance related issues
 Monitoring related issues
Day 21
 HIVE
 Hive UDF, UDAF, UDTF
 Writing custom UDF
 SerDe and role of SerDe
 Writing SerDe
 Advanced Analytical Functions
 Real Time Query
 Difference Stinger and Impala?
 Key Emerging Trends
 Implementing Updates and Deletes in HIVE
Day 22
 PIG
 Architecture for PIG
 Advanced PIG Join Types
 Advanced PIG Latin Commands
 PIG Macros and their Limitations
 Typical Issues with PIG
 When to use PIG and When not to?
Day 23
 Oozie
 Architecture and Fundamentals
 Installing and Configuring Oozie
 Oozie Workflows
 Coordinator Jobs
 Bundle Jobs
 Difference patterns in Oozie Scheduling
 How to troubleshoot in Oozie
 How to handle different libraries in Oozie
 Hands on example with Oozie
 HUE
 Architecture and Fundamentals
 Installing and Configuring HUE
 Executing PIG, HIVE, Map-Reduce through HUE using Oozie
 Various features of HUE
 Integration of HUE users with Enterprise Identity Management systems
Day 24
 Flume
 Flume Architecture
 Complex and Multiplexing Flows in Flume
 AVRO-RPC
 Configuring and running flume agents for the various supported sources (NetCat, JMS, Exec, Thrift, AVRO)
 Configuring and running flume agents with various supported sinks (HDFS, Logger, AVRO, Hbase, FileRoll, ElasticSearch etc)
 Understanding Batch load to HDFS
 Example with Flume in real project scenarios for
 Log Analytics
 Machine data collection with SNMP sources
 Social Media Analytics
 Typical challenges with Flume operations
 Integration with HIVE and Hbase
 Implementing Custom Flume Sources and Sinks
 Flume Security with Kerberos
Day 25
 Zookeeper
 Architecture
 High Scalability with Zookeeper
 Common Recipes with Zookeeper
 Leader Election
 Distributed Transaction Management
 Node Failure Detections and Cluster Membership management
 Co-ordination Services
 Cluster Deployment recipe with Zookeeper
 Typical challenges with Zookeeper operations
 YARN
 YARN Architecture and Advanced Concepts
Day 26
 End to End POC Design
 Live Example of end to end POC which has all ecosystem components