Rhipe on AWS (YARN and MRv1)

Rhipe is an R library that runs on top of hadoop. Rhipe is using hadoop streaming concept for running R programs in hadoop. To know more about Rhipe, please check my older post. My previous post on Rhipe was the basic explanation and the installation steps for running Rhipe in cdh4(MRv1). Now yarn became popular and almost everyone are using YARN. So a lot of people asked me assistance for installing Rhipe in YARN. Rhipe works on yarn very well. Here I am just giving a pointer on how to install Rhipe on AWS (Amazon Web Services). I checked this script and it is working fine. This contains the bootstrap script and installables that installs Rhipe automatically in AWS. For those who are new to AWS, I will explain the basics of AWS EMR and bootstrap script. Amazon Web Services are providing a lot of cloud services. Among that Elastic Mapreduce(EMR) is a service that provides a hadoop cluster. This is one of the best solution for users who don’t want to maintain a data center and don’t want to take the headaches of hadoop administration.

AWS is providing a list of components for installing in the hadoop cluster. Those services we can choose while installing the hadoop cluster through the web console. Examples for such components are hive, pig, impala, hue, hbase, hunk etc. But in most of the cases, user may require some extra softwares also. This extra requirement depends on user. If the user try to install the extra service manually in the cluster, it will take lot of time. The automated cluster launch will take less than 10 minutes.( I tried for around 100 nodes). But if you install the software in all of these nodes manually, it will take several hours. For this problem, amazon is providing a solution. User can provide any custom shell scripts and these scripts will be executed on all the nodes while installing the hadoop. This script is called bootstrap script.

Here we are installing Rhipe using a bootstrap script. For users who want to install Rhipe on  AWS Hadoop MRv1, you can follow this url. Please ensure that you are using the correct AMI. AMI is Amazon Machine Image. This is just a version of the image that they are providing. For those users who want to install Rhipe on AWS Hadoop MRv2 (YARN), you can follow this url. This will work perfectly on AWS AMI 3.2.1. You can download the github repo in your local and put it your S3. Then launch the cluster by specifying the details mentioned in the installation doc.

For non-aws users

For those users who want to install Rhipe on yarn (Any hadoop cluster), you can either build the Rhipe for their corresponding version of hadoop and put that jar inside Rhipe directory or you can directly try using the ready made rhipe for YARN. All the Rhipe versions are available in a common repository. You can download the installable from this location. You have to follow the steps mentioned in the all the shell scripts present in the given repository. This is a manual activity and you have to do this activity on all the nodes in your hadoop cluster.

Hadoop Versions

Apache Hadoop Versions

Hadoop Versions

Hadoop 2.0.3-alpha (released on 14 February, 2013) 2.X.X – current alpha version
Hadoop 2.0.2-alpha (released on 9 October, 2012)
Hadoop 2.0.1-alpha (released on 26 July, 2012)
Hadoop 2.0.0-alpha (released on 23 May, 2012)

Hadoop 1.1.2 (released on 15 February, 2013) 1.1.X – current beta version
Hadoop 1.1.1 (released on 1 December, 2012)
Hadoop 1.1.0 (released on 13 October, 2012)
Hadoop 1.0.4 (released on 12 October, 2012) 1.0.X – current stable version
Hadoop 1.0.3 (released on 16 May, 2012)
Hadoop 1.0.2 (released on 3 Apr, 2012)
Hadoop 1.0.1 (released on 10 Mar, 2012)
Hadoop 1.0.0 (released on 27 December, 2011)

Hadoop 0.23.6 (released on 7 February, 2013) 0.23.X – simmilar to 2.X.X but missing NN HA
Hadoop 0.23.5 (released on 28 November, 2012)
Hadoop 0.23.4 (released on 15 October, 2012)
Hadoop 0.23.3 (released on 17 September, 2012)
Hadoop 0.23.1 (released on 27 Feb, 2012)
Hadoop 0.22.0 (released on 10 December, 2011) 0.22.X – does not include security
Hadoop 0.23.0 (released on 11 Nov, 2011)
Hadoop (released on 17 Oct, 2011)
Hadoop (released on 5 Sep, 2011)
Hadoop (released on 11 May, 2011) 0.20.203.X – old legacy stable version
Hadoop 0.21.0 (released on 23 August, 2010)
Hadoop 0.20.2 (released on 26 February, 2010) 0.20.X – old legacy version
Hadoop 0.20.1 (released on 14 September, 2009)
Hadoop 0.19.2 (released on 23 July, 2009)
Hadoop 0.20.0 (released on 22 April, 2009)
Hadoop 0.19.1 (released on 24 February, 2009)
Hadoop 0.18.3 (released on 29 January, 2009)
Hadoop 0.19.0 (released on 21 November, 2008)
Hadoop 0.18.2 (released on 3 November, 2008)
Hadoop 0.18.1 (released on 17 September, 2008)
Hadoop 0.18.0 (released on 22 August, 2008)
Hadoop 0.17.2 (released on 19 August, 2008)
Hadoop 0.17.1 (released on 23 June, 2008)
Hadoop 0.17.0 (released on 20 May, 2008)
Hadoop 0.16.4 (released on 5 May, 2008)
Hadoop 0.16.3 (released on 16 April, 2008)
Hadoop 0.16.2 (released on 2 April, 2008)
Hadoop 0.16.1 (released on 13 March, 2008)
Hadoop 0.16.0 (released on 7 February, 2008)
Hadoop 0.15.3 (released on 18 January, 2008)
Hadoop 0.15.2 (released on 2 January, 2008)
Hadoop 0.15.1 (released on 27 November, 2007)
Hadoop 0.14.4 (released on 26 November, 2007)
Hadoop 0.15.0 (released on 29 October 2007)
Hadoop 0.14.3 (released on 19 October, 2007)
Hadoop 0.14.1 (released on 4 September, 2007)

Hadoop Distributions

Below are the companies offering commercial implementations and/or providing support for Apache Hadoop, which is the base for all the below.

  • Cloudera offers CDH (Cloudera’s Distribution including Apache Hadoop) and Cloudera Enterprise.
  • Hortonworks (formed by Yahoo and Benchmark Capital), whose focus is on making Hadoop more robust and easier to install, manage and use for enterprise users. Hortonworks provides Hortonworks Data Platform (HDP).
  • MapR Technologies offers distributed filesystem and MapReduce engine, the MapR Distribution for Apache Hadoop.
  • Oracle announced the Big Data Appliance, which integrates Cloudera’s Distribution Including Apache Hadoop (CDH).
  • IBM offers InfoSphere BigInsights based on Hadoop in both a basic and enterprise edition.
  • Greenplum, A Division of EMC, offers Hadoop in Community and Enterprise editions.
  • Intel – the Intel Distribution for Apache Hadoop is the product includes the Intel Manager for Apache Hadoop for managing a cluster.
  • Amazon Web Services – Amazon offers a version of Apache Hadoop on their EC2 infrastructure, sold as Amazon Elastic MapReduce.
  • VMware – Initiate Open Source project and product to enable easily and efficiently deploy and use Hadoop on virtual infrastructure.
  • Bigtop – project for the development of packaging and tests of the Apache Hadoop ecosystem.
  • DataStax – DataStax provides a product of Hadoop which fully integrates Apache Hadoop with Apache Cassandra and Apache Solr in its DataStax Enterprise platform.
  • Cascading – A popular feature-rich API for defining and executing complex and fault tolerant data processingworkflows on a Apache Hadoop cluster.
  • Mahout – Apache project using Hadoop to build scalable machine learning algorithms like canopy clustering, k-means and many more.
  • Cloudspace – uses Apache Hadoop to scale client and internal projects on Amazon’s EC2 and bare metal architectures.
  • Datameer – Datameer Analytics Solution (DAS) is a Hadoop-based solution for big data analytics that includes data source integration, storage, an analytics engine and visualization.
  • Data Mine Lab – Developing solutions based on Hadoop, Mahout, HBase and Amazon Web Services.
  • BigDataEdge (Infosys) – An Insight creation product which contains hundreds of components to get accurate insights with no pains.
  • Debian – A Debian package of Apache Hadoop is available.
  • HStreaming – offers real-time stream processing and continuous advanced analytics built into Hadoop, available as free community edition, enterprise edition, and cloud service.
  • Impetus
  • Karmasphere – Distributes Karmasphere Studio for Hadoop, which allows cross-version development and management of Apache Hadoop jobs.
  • Nutch – Apache Nutch, flexible web search engine software.
  • NGDATA – Makes available Lily Open Source that builds upon Hadoop, HBase and SOLR. Distributes Lily Enterprise.
  • Pentaho – Pentaho provides a complete, end-to-end open-source BI and offers an easy-to-use, graphical ETL tool that is integrated with Apache Hadoop for managing data and coordinating Hadoop related tasks in the broader context of ETL and Business Intelligence workflow.
  • Pervasive Software – Provides Pervasive DataRush, a parallel dataflow framework which improvesperformance of Apache Hadoop and MapReduce jobs by exploiting fine-grained parallelism on multicore servers.
  • Platform Computing – Provides an Enterprise Class MapReduce solution for Big Data Analytics with high scalability and fault tolerance. Platform MapReduce provides unique scheduling capabilities and its architecture is based on almost two decades of distributed computing research and development.
  • Sematext International – Provides consulting services around Apache Hadoop and Apache HBase, along with large-scale search using Apache Lucene, Apache Solr, and Elastic Search.
  • Talend – Talend Platform for Big Data includes support and management tools for all the major Apache Hadoop distributions. Talend Open Studio for Big Data is an Apache License Eclipse IDE, which provides a set of graphical components for HDFS, HBase, Pig, Sqoop and Hive.
  • Think Big Analytics – Offers expert consulting services specializing in Apache Hadoop, MapReduce and relateddata processing architectures.
  • Tresata – Financial Industry’s first software platform architected from the ground up on Hadoop. Data storage, processing, analytics and visualization all done on Hadoop.
  • WANdisco is a committed member & sponsor of the Apache Software community and has active committers on several projects including Apache Hadoop.