CDH cluster installation failing in “distributing” stage- Failure due to stall on seeded torrent

I faced this issue while distributing the downloaded packages in cloudera manager.

The solution that worked for me is to add the IP Address – Hostname mapping in all the /etc/hosts files of all the cloudera manager server and agents

/etc/hosts   cdhdatanode1

ERROR Failed to collect NTP metrics – Cloudera Manager Agent

If you are facing an error like “Failed to collect NTP metrics”. The following solution might help you. This is because of the lack of ntp server in the server. The below solution will work for CentOS/RHEL systems. NTP will sync the system time with the network time.

yum install ntp

systemctl enable ntpd

systemctl restart ntpd

Undeleting and purging KeyTrustee Key Provider methods via the REST interface

HDFS Data encryption is an excellent feature that came recently. With this we can encrypt the data in hdfs. We can create multiple encryption zones with different encryption keys. In this way, we can secure the data in hdfs properly. For more details, you can visit these websites. Reference1, Reference2

I am using a cluster installed with CDH. I created some encryption keys and zones.
The command I used for creating a key is given below.

# As the normal user, create a new encryption key
hadoop key create amalKey

# As the super user, create a new empty directory and make it an encryption zone
hadoop fs -mkdir /user/amal
hdfs crypto -createZone -keyName amalKey -path /user/amal

# chown it to the normal user
hadoop fs -chown amal:hadoop /user/amal

# As the normal user, put a file in, read it out
hadoop fs -put test.txt /user/amal/
hadoop fs -cat /user/amal/test.tx

After some days, I deleted the encryption zone and I deleted the encryption key also.
The command I used for deleting the encryption key is given below

hadoop key delete <key-name>

After the deletion, I tried creating the key with the same name. But I got an exception that the key is still present in the disabled state. When I list the keys, I am not able to see the key. The exception that I got was given below.

amalKey has not been created. HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.] HTTP status [500], exception [com.cloudera.keytrustee.TrusteeKeyProvider$DuplicateKeyException], message [Key with name "amalKey" already exists in "com.cloudera.keytrustee.TrusteeKeyProvider@6d88562. Key exists but has been disabled. Use undelete to enable.]
at org.apache.hadoop.util.HttpExceptionUtils.validateResponse(
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKeyInternal(
at org.apache.hadoop.crypto.key.kms.KMSClientProvider.createKey(
at org.apache.hadoop.crypto.key.KeyShell$CreateCommand.execute(
at org.apache.hadoop.crypto.key.KeyShell.main(

In the error logs, it says to use purge option to permanently delete the key and undelete to recover the deleted key. But I was not able to find these options with hadoop key command. I googled it and I couldn’t figure out this issue. Finally I got the guidance from one guy from cloudera to execute the purge & undelete commands through rest api of keytrustee and he gave a nice explanation for my issue. I am briefly putting the solution for this exception below.

The delete operation on the Trustee key provider is a “soft delete”, meaning that is possible to “undelete” the key. It is also possible to “purge” the key to delete it permanently. Because these operations are not part of the standard Hadoop key provider API, they are not currently exposed through Hadoop KeyShell (hadoop key). However, you can call these operations directly via the Trustee key provider REST API.

See the examples below.

Use KeyShell to list existing keys

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]

Use KeyShell to delete an existing key

$ ./bin/hadoop key delete amal-testkey-1 -provider kms://http@localhost:16000/kms
Deleting key: ajy-testkey-1 from KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]
amal-testkey-1 has been successfully deleted.
KMSClientProvider[http://localhost:16000/kms/v1/] has been updated.

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]

Use the KeyTrustee key provider REST API to undelete the deleted key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?"

Use KeyShell to verify the key was restored

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]

Use the KeyTrustee key provider REST API to purge the restored key

$ curl L -d "trusteeOp=purge" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?"

Use KeyShell to verify the key was deleted

$ ./bin/hadoop key list -provider kms://http@localhost:16000/kms
Listing keys for KeyProvider: KMSClientProvider[http://localhost:16000/kms/v1/]

Use the KeyTrustee key provider REST API to attempt to undelete the purged key

$ curl -L -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?"

"RemoteException" : {
"message" : "Key with name amal-testkey-1 not found in com.cloudera.keytrustee.TrusteeKeyProvider@6d88562",
"exception" : "IOException",
"javaClassName" : ""

Configure ACLs for KeyTrustee undelete, purge and migrate operations

ACLs for the KeyTrustee specific undelete, purge and migrate operations are configured in kts-acls.xml. Place this file in the same location as your kms-acls.xml file. See example below.

          ACL for undelete-key operations.
         ACL for purge-key operations.
      ACL for purge-key operations.

Note: In kerberized environments, the requests will be a little different. It will be in the following format

Eg :
curl -L --negotiate -u [username]  -d "trusteeOp=undelete" "http://localhost:16000/kms/v1/trustee/key/amal-testkey-1?{username}&trusteeOp=undelete"

Hadoop Distributions

Below are the companies offering commercial implementations and/or providing support for Apache Hadoop, which is the base for all the below.

  • Cloudera offers CDH (Cloudera’s Distribution including Apache Hadoop) and Cloudera Enterprise.
  • Hortonworks (formed by Yahoo and Benchmark Capital), whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users. Hortonworks provides Hortonworks Data Platform (HDP).
  • MapR Technologies offers distributed filesystem and MapReduce engine, the MapR Distribution for Apache Hadoop.
  • Oracle announced the Big Data Appliance, which integrates Cloudera’s Distribution Including Apache Hadoop (CDH).
  • IBM offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition.
  • Greenplum, A Division of EMC, offers Hadoop in Community and Enterprise editions.
  • Intel – the Intel Distribution for Apache Hadoop is the product includes the Intel Manager for Apache Hadoop for managing a cluster.
  • Amazon Web Services – Amazon offers a version of Apache Hadoop on their EC2 infrastructure, sold as Amazon Elastic MapReduce.
  • VMware – Initiate Open Source project and product to enable easily and efficiently deploy and use Hadoop on virtual infrastructure.
  • Bigtop – project for the development of packaging and tests of the Apache Hadoop ecosystem.
  • DataStax – DataStax provides a product of Hadoop which fully integrates Apache Hadoop with Apache Cassandra and Apache Solr in its DataStax Enterprise platform.
  • Cascading – A popular feature-rich API for defining and executing complex and fault tolerant data processingworkflows on a Apache Hadoop cluster.
  • Mahout – Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more.
  • Cloudspace – uses Apache Hadoop to scale client and internal projects on Amazon’s EC2 and bare metal architectures.
  • Datameer – Datameer Analytics Solution (DAS) is a Hadoop-based solution for big data analytics that includes data source integration, storage, an analytics engine and visualization.
  • Data Mine Lab – Developing solutions based on Hadoop, Mahout, HBase and Amazon Web Services.
  • BigDataEdge (Infosys) – An Insight creation product which contains hundreds of components to get accurate insights with no pains.
  • Debian – A Debian package of Apache Hadoop is available.
  • HStreaming – offers real-time stream processing and continuous advanced analytics built into Hadoop, available as free community edition, enterprise edition, and cloud service.
  • Impetus
  • Karmasphere – Distributes Karmasphere Studio for Hadoop, which allows cross-version development and management of Apache Hadoop jobs.
  • Nutch – Apache Nutch, flexible web search engine software.
  • NGDATA – Makes available Lily Open Source that builds upon Hadoop, HBase and SOLR. Distributes Lily Enterprise.
  • Pentaho – Pentaho provides a complete, end-to-end open-source BI and offers an easy-to-use, graphical ETL tool that is integrated with Apache Hadoop for managing data and coordinating Hadoop related tasks in the broader context of ETL and Business Intelligence workflow.
  • Pervasive Software – Provides Pervasive DataRush, a parallel dataflow framework which improvesperformance of Apache Hadoop and MapReduce jobs by exploiting fine-grained parallelism on multicore servers.
  • Platform Computing – Provides an Enterprise Class MapReduce solution for Big Data Analytics with high scalability and fault tolerance. Platform MapReduce provides unique scheduling capabilities and its architecture is based on almost two decades of distributed computing research and development.
  • Sematext International – Provides consulting services around Apache Hadoop and Apache HBase, along with large-scale search using Apache Lucene, Apache Solr, and Elastic Search.
  • Talend – Talend Platform for Big Data includes support and management tools for all the major Apache Hadoop distributions. Talend Open Studio for Big Data is an Apache License Eclipse IDE, which provides a set of graphical components for HDFS, HBase, Pig, Sqoop and Hive.
  • Think Big Analytics – Offers expert consulting services specializing in Apache Hadoop, MapReduce and relateddata processing architectures.
  • Tresata – Financial Industry’s first software platform architected from the ground up on Hadoop. Data storage, processing, analytics and visualization all done on Hadoop.
  • WANdisco is a committed member & sponsor of the Apache Software community and has active committers on several projects including Apache Hadoop.