R and Big Data

Now R programming is getting more attention among people. The reason I found was that it can be used efficiently for big data analytics. R is a good statistical tool. Its applicability in big data analytics is very much. Now the system is trying to learn from data or else we are trying to teach the system using data. With advanced analytics with R programming, it is very easy to generate insights from large data. Now a lot of packages are available for R that makes it powerful and capable to work on top of latest Big data technologies. Some of the libraries that I have noticed are listed below.

1) Rhipe: RHIPE (hree-pay’) is the R and Hadoop Integrated Programming Environment.
For more details Rhipe

2) Rhive : RHive is an R extension facilitating distributed computing via Apache Hive.
For more details Rhive

3) Rhbase : This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE.
For more details Rhbase

4) Rhdfs : This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS.
For more details Rhdfs

5) Rmr : This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.
For more details Rmr

6) Plyrmr : This R package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop mapreduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the mapreduce details.
For more details Plyrmr

7) Rmongo : MongoDB Database interface for R. The interface is provided via Java calls to the mongo-java-driver.
For more details Rmongo

Facebook Opensourced Presto

Facebook opensourced its data processing technique ‘Presto’ to the world. Presto is a distributed query engine based on ANSI SQL. It is very optimized and currently running with more than 300 petabytes of data, which may one among the top big data processing systems. Presto is a totally different from mapreduce. It is an in memory data processing mechanism and is very much optimised. From the details given in the facebook newsletter and presto website, it is 10 times faster than Hive. mapreduce.Hive came from facebook only, so presto will definitely beat hive. Hive queries are ultimately running as multiple mapreduce jobs and it will take more time. From my point of view, the competition may be between Cloudera Impala and Presto. Impala’s performance with huge datasets is not available now from any production environments because it is a budding technology from cloudera family, but presto is already tested and running in huge dataset production environment. Another interesting fact about presto is that we can use the already existing infrastructure and hadoop cluster for deploying presto, because presto supports hdfs as its underlying data storage. It supports other storage systems also. So it is flexible. Leading internet companies including Airbnb and Dropbox are using Presto. Presto code and further details are available in this link

I have deployed Presto and Impala on a small cluster of 8 nodes. I haven’t got enough time to explore more on presto. I am planning to explore more on the coming days. 🙂