Advertisements

RJDBC java.lang.OutOfMemoryError

You might see the below error while making jdbc connections from R programs.

java.lang.OutOfMemoryError: Java heap space

If you face java heap size exceptions in RJDBC connections like above, simply increase the JAVA heap size from your R program. Sample snippet is given below.

options(java.parameters = "-Xmx8048m")
library("RJDBC")

or

options(java.parameters = "-Xmx8g")
library("RJDBC")

Hope this helps you.

Advertisements

Rhipe on AWS (YARN and MRv1)

Rhipe is an R library that runs on top of hadoop. Rhipe is using hadoop streaming concept for running R programs in hadoop. To know more about Rhipe, please check my older post. My previous post on Rhipe was the basic explanation and the installation steps for running Rhipe in cdh4(MRv1). Now yarn became popular and almost everyone are using YARN. So a lot of people asked me assistance for installing Rhipe in YARN. Rhipe works on yarn very well. Here I am just giving a pointer on how to install Rhipe on AWS (Amazon Web Services). I checked this script and it is working fine. This contains the bootstrap script and installables that installs Rhipe automatically in AWS. For those who are new to AWS, I will explain the basics of AWS EMR and bootstrap script. Amazon Web Services are providing a lot of cloud services. Among that Elastic Mapreduce(EMR) is a service that provides a hadoop cluster. This is one of the best solution for users who don’t want to maintain a data center and don’t want to take the headaches of hadoop administration. AWS is providing a list of components for installing in the hadoop cluster. Those services we can choose while installing the hadoop cluster through the web console. Examples for such components are hive, pig, impala, hue, hbase, hunk etc. But in most of the cases, user may require some extra softwares also. This extra requirement depends on user. If the user try to install the extra service manually in the cluster, it will take lot of time. The automated cluster launch will take less than 10 minutes.( I tried for around 100 nodes). But if you install the software in all of these nodes manually, it will take several hours. For this problem, amazon is providing a solution. User can provide any custom shell scripts and these scripts will be executed on all the nodes while installing the hadoop. This script is called bootstrap script. Here we are installing Rhipe using a bootstrap script. For users who want to install Rhipe on  AWS Hadoop MRv1, you can follow this url. Please ensure that you are using the correct AMI. AMI is Amazon Machine Image. This is just a version of the image that they are providing. For those users who want to install Rhipe on AWS Hadoop MRv2 (YARN), you can follow this url. This will work perfectly on AWS AMI 3.2.1. You can download the github repo in your local and put it your S3. Then launch the cluster by specifying the details mentioned in the installation doc.

For non-aws users

For those users who want to install Rhipe on yarn (Any hadoop cluster), you can either build the Rhipe for their corresponding version of hadoop and put that jar inside Rhipe directory or you can directly try using the ready made rhipe for YARN. All the Rhipe versions are available in a common repository. You can download the installable from this location. You have to follow the steps mentioned in the all the shell scripts present in the given repository. This is a manual activity and you have to do this activity on all the nodes in your hadoop cluster.

R and Big Data

Now R programming is getting more attention among people. The reason I found was that it can be used efficiently for big data analytics. R is a good statistical tool. Its applicability in big data analytics is very much. Now the system is trying to learn from data or else we are trying to teach the system using data. With advanced analytics with R programming, it is very easy to generate insights from large data. Now a lot of packages are available for R that makes it powerful and capable to work on top of latest Big data technologies. Some of the libraries that I have noticed are listed below.

1) Rhipe: RHIPE (hree-pay’) is the R and Hadoop Integrated Programming Environment.
For more details Rhipe

2) Rhive : RHive is an R extension facilitating distributed computing via Apache Hive.
For more details Rhive

3) Rhbase : This R package provides basic connectivity to HBASE, using the Thrift server. R programmers can browse, read, write, and modify tables stored in HBASE.
For more details Rhbase

4) Rhdfs : This R package provides basic connectivity to the Hadoop Distributed File System. R programmers can browse, read, write, and modify files stored in HDFS.
For more details Rhdfs

5) Rmr : This R package allows an R programmer to perform statistical analysis via MapReduce on a Hadoop cluster.
For more details Rmr

6) Plyrmr : This R package enables the R user to perform common data manipulation operations, as found in popular packages such as plyr and reshape2, on very large data sets stored on Hadoop. Like rmr, it relies on Hadoop mapreduce to perform its tasks, but it provides a familiar plyr-like interface while hiding many of the mapreduce details.
For more details Plyrmr

7) Rmongo : MongoDB Database interface for R. The interface is provided via Java calls to the mongo-java-driver.
For more details Rmongo

RStudio Installation

RStudio is an IDE for R. It will make the R programming more user friendly.

Here I am explaining the way to install RStudio in linux machines.
RStudio needs R. So before installing RStudio, we have to install R. Installation of R is explained in my previous post.
R-3.0.0 needs RStudio-0.97 or above.
We can download Rstudio server rpm from the below website http://www.rstudio.com/ide/download/
Then install it using the command

rpm –ivh <rpm-name>

Sometimes it may show dependency issues.
Install the necessary dependencies and go ahead.

After that start R-Studio server.

/etc/init.d/rstudio-server start

If the installation is done correctly it will start.
You can verify the installation using the command

sudo rstudio-server verify-installation

Then go to webbrowser (Mozilla, chrome) and type the url
http://ipaddress:8787

8787 is the default port, you can change this by editing the configuration file.
All the users in the linux except the system users(whose userid lower than 100) can use rstudio.
The credentials are the same as linux credentials.

Rhipe Installation

Rhipe was first developed by Saptarshi Guha.
Rhipe needs R and Hadoop. So first install R and hadooop. Installation of R and hadoop are well explained in my previous posts. The latest version of Rhipe as of now is Rhipe-0.73.1. and  latest available version of R is R-3.0.0. If you are using CDH4 (Cloudera distribution of hadoop) , use Rhipe-0.73 or later versions, because older versions may not work with CDH4.
Rhipe is an R and Hadoop integrated programming environment. Rhipe integrates R and Hadoop. Rhipe is very good for statistical and analytical calculations of very large data. Because here R is integrated with hadoop, so it will process in distributed mode, ie  mapreduce.
Futher explainations of Rhipe are available in http://www.datadr.org/

Prerequisites

Hadoop, R, protocol buffers and rJava should be installed before installing Rhipe.
We are installing Rhipe in a hadoop cluster. So the job submitted may execute in any of the tasktracker nodes. So we have to install R and Rhipe in all the tasktracker nodes, otherwise you will face an exception “Cannot find R” or something similar to that.

Installing Protocol Buffer

Download the protocol buffer 2.4.1 from the below link

http://protobuf.googlecode.com/files/protobuf-2.4.1.tar.gz

tar -xzvf protobuf-2.4.1.tar.gz

cd protobuf-2.4.1

chmod -R 755 protobuf-2.4.1

./configure

make

make install

Set the environment variable PKG_CONFIG_PATH

nano /etc/bashrc

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig

save and exit

Then executed the following commands to check the installation

pkg-config --modversion protobuf

This will show the version number 2.4.1
Then execute

pkg-config --libs protobuf

This will display the following things

-pthread -L/usr/local/lib -lprotobuf -lz –lpthread

If these two are working fine, This means that the protobuf is properly installed.

Set the environment variables for hadoop

For example

nano /etc/bashrc

export HADOOP_HOME=/usr/lib/hadoop

export HADOOP_BIN=/usr/lib/hadoop/bin

export HADOOP_CONF_DIR=/etc/hadoop/conf

save and exit

Then


cd /etc/ld.so.conf.d/

nano Protobuf-x86.conf

/usr/local/lib   # add this value as the content of Protobuf-x86.conf

Save and exit

/sbin/ldconfig

Installing rJava

Download the rJava tarball from the below link.

http://cran.r-project.org/web/packages/rJava/index.html

The latest version of rJava available as of now is rJava_0.9-4.tar.gz

install rJava using the following command

R CMD INSTALL rJava_0.9-4.tar.gz

Installing Rhipe

Rhipe can be downloaded from the following link
https://github.com/saptarshiguha/RHIPE/blob/master/code/Rhipe_0.73.1.tar.gz

R CMD INSTALL Rhipe_0.73.1.tar.gz

This will install Rhipe

After this type R in the terminal

You will enter into R terminal

Then type

library(Rhipe)

#This will display

------------------------------------------------

| Please call rhinit() else RHIPE will not run |

————————————————

rhinit()

#This will display

Rhipe: Detected CDH4 jar files, using RhipeCDH4.jar
Initializing Rhipe v0.73
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
Initializing mapfile caches

Now you can execute you Rhipe scripts.

R Installation in Linux Platforms

R is a free software that is for statistical and analytical computations. It is a very good tool for graphical computations also.
R is used in a wide range of areas. R allows us to carryout statistical analysis in an interactive model.
To use R, first we need to install R program in our computer. R can be installed windows, Linux, Mac OS etc.

In Linux platforms, we usually install by compiling the tarball.
The latest stable version of R as of now is R-3.0.0
Installation of R in Linux platforms is explained below.

Installation Using rpm

If you are using Redhat or CentOS distribution of linux, then you can either install using tarballs or rpm.
But latest versions may not be available as rpm.
Installation using rpm is simple.
Just download the rpm files with dependencies and install each using the command

rpm –ivh <rpm-name>

Installation Using yum

If you are having Internet connection,
Then installation is very simple.
Just do the following commands.

yum install R-core R-2*

Installing R using tarball

R is available as tarball which is compatible with all  linux platforms.
Latest versions of R are available as tarball.

The installation steps are given below.
Get the latest R tar file  for Linux from http://ftp.iitm.ac.in/cran/

Extract the tarball

tar   –xzvf   R-xxx.tar.gz

Change the permission of the extracted file

chmod –R 755 R-xxx

then go inside the extracted R directory and do the following steps

 cd   R.xxx
./configure  --enable-R-shlib

The above step may fail because of the lack of dependent libraries in your OS.
If it is failing, install the dependent libraries and do the above step again.
If this is done successfully, do the following steps.

make

make install

 

After this set the R_HOME and PATH in /etc/bashrc (Redhat or CentOS) or ~/.bashrc (if no root privilege) or /etc/bash.bashrc (Ubuntu)

export R_HOME= <path to R installation>
export PATH=$PATH:$R_HOME/bin

 

Then do the following command

source /etc/bashrc  (For Redhat or CentOS)

Or

source /etc/bash.bashrc    (If no root privilege)

Or

source ~/.bashrc   ( For Ubuntu)

 

Check R installation

Type R in your terminal

If R prompt is coming, your R installation is successful.

You can quit from R by using the command q()