## Basic statistics using Python

Python comes with a built-in statistics module. This will help us to perform the statistical calculations very easily.

The following are the commonly used statistical functions.

### Arithmetic Mean

Arithmetic mean is the average of a group of values. The mathematical equation is

`Mean = Sum of group of values / Total number of values in the group`

Mean vs Average: What’s the Difference?

Answer: Both are same. No difference

Suppose we have a list of values as shown below.

`values = [1,2,3,4,5,6,7,8]`

For calculating the mean, without using any built-in function, we have to use the following snippet of the code

```values = [1,2,3,4,5,6,7,8]
sum = 0
for value in values:
sum += value

mean = sum/len(values)
print("Sum -->:", sum)
print("Total Count-->:", len(values))
print("Arithmetic Mean-->:", mean)```

The above program involves multiple steps. Instead of writing the entire logic, we can easily calculate the mean using the following code snippet

```import statistics
values = [1,2,3,4,5,6,7,8]
print("Arithmetic Mean--> ", statistics.mean(values))```

### Arithmetic Mode

Arithmetic mode refers to the most frequently occurred value in a data set. Mode can be calculated very easily using the statistics.mode() function

```import statistics
values = [1,2,2,2,2,2,2,1,2,3,4,5,2,3,4,5,6,66,6,6,6,6]
print(statistics.mode(values))```

### Arithmetic Median

Median is basically the mid value in the numerical data set. The median is calculated by ordering the numerical data set from lowest to highest and finding the number in the exact middle. If the count of total numbers in the group is an odd number, the median will be the number which is in the exact middle of the ordered list. If the count of total numbers is an even number, then the median will be the mean of the numbers that reside in the middle of the ordered list.

This can be simply calculated by the statistics.median() function.

```import statistics
values = [21,1,2,3,4,5,6,7,8,24,29,50]
print("Arithmetic Median--> ", statistics.median(values))```

## RStudio Installation

RStudio is an IDE for R. It will make the R programming more user friendly.

Here I am explaining the way to install RStudio in linux machines.
RStudio needs R. So before installing RStudio, we have to install R. Installation of R is explained in my previous post.
R-3.0.0 needs RStudio-0.97 or above.
Then install it using the command

```rpm –ivh <rpm-name>
```

Sometimes it may show dependency issues.
Install the necessary dependencies and go ahead.

After that start R-Studio server.

```/etc/init.d/rstudio-server start
```

If the installation is done correctly it will start.
You can verify the installation using the command

```sudo rstudio-server verify-installation
```

Then go to webbrowser (Mozilla, chrome) and type the url

8787 is the default port, you can change this by editing the configuration file.
All the users in the linux except the system users(whose userid lower than 100) can use rstudio.
The credentials are the same as linux credentials.

## Rhipe Installation

Rhipe was first developed by Saptarshi Guha.
Rhipe needs R and Hadoop. So first install R and hadooop. Installation of R and hadoop are well explained in my previous posts. The latest version of Rhipe as of now is Rhipe-0.73.1. and  latest available version of R is R-3.0.0. If you are using CDH4 (Cloudera distribution of hadoop) , use Rhipe-0.73 or later versions, because older versions may not work with CDH4.
Rhipe is an R and Hadoop integrated programming environment. Rhipe integrates R and Hadoop. Rhipe is very good for statistical and analytical calculations of very large data. Because here R is integrated with hadoop, so it will process in distributed mode, ie  mapreduce.
Futher explainations of Rhipe are available in http://www.datadr.org/

### Prerequisites

Hadoop, R, protocol buffers and rJava should be installed before installing Rhipe.
We are installing Rhipe in a hadoop cluster. So the job submitted may execute in any of the tasktracker nodes. So we have to install R and Rhipe in all the tasktracker nodes, otherwise you will face an exception “Cannot find R” or something similar to that.

### Installing Protocol Buffer

```tar -xzvf protobuf-2.4.1.tar.gz

cd protobuf-2.4.1

chmod -R 755 protobuf-2.4.1

./configure

make

make install
```

Set the environment variable PKG_CONFIG_PATH

```nano /etc/bashrc

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
```

save and exit

Then executed the following commands to check the installation

```pkg-config --modversion protobuf
```

This will show the version number 2.4.1
Then execute

```pkg-config --libs protobuf
```

This will display the following things

`-pthread -L/usr/local/lib -lprotobuf -lz –lpthread`

If these two are working fine, This means that the protobuf is properly installed.

### Set the environment variables for hadoop

For example

```nano /etc/bashrc

```

save and exit

Then

```
cd /etc/ld.so.conf.d/

nano Protobuf-x86.conf

/usr/local/lib   # add this value as the content of Protobuf-x86.conf

```

Save and exit

```/sbin/ldconfig
```

### Installing rJava

http://cran.r-project.org/web/packages/rJava/index.html

install rJava using the following command

```R CMD INSTALL rJava_0.9-4.tar.gz
```

## Installing Rhipe

https://github.com/saptarshiguha/RHIPE/blob/master/code/Rhipe_0.73.1.tar.gz

```R CMD INSTALL Rhipe_0.73.1.tar.gz
```

This will install Rhipe

After this type R in the terminal

You will enter into R terminal

Then type

```library(Rhipe)
```

#This will display
``` ------------------------------------------------```

| Please call rhinit() else RHIPE will not run |

————————————————

```rhinit()
```

#This will display

```Rhipe: Detected CDH4 jar files, using RhipeCDH4.jar Initializing Rhipe v0.73 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client-0.20/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/client/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. Initializing mapfile caches```

Now you can execute you Rhipe scripts.

## R Installation in Linux Platforms

R is a free software that is for statistical and analytical computations. It is a very good tool for graphical computations also.
R is used in a wide range of areas. R allows us to carryout statistical analysis in an interactive model.
To use R, first we need to install R program in our computer. R can be installed windows, Linux, Mac OS etc.

In Linux platforms, we usually install by compiling the tarball.
The latest stable version of R as of now is R-3.0.0
Installation of R in Linux platforms is explained below.

## Installation Using rpm

If you are using Redhat or CentOS distribution of linux, then you can either install using tarballs or rpm.
Installation using rpm is simple.
Just download the rpm files with dependencies and install each using the command

```rpm –ivh <rpm-name>
```

## Installation Using yum

If you are having Internet connection,
Then installation is very simple.
Just do the following commands.

```yum install R-core R-2*
```

## Installing R using tarball

R is available as tarball which is compatible with all  linux platforms.

The installation steps are given below.
Get the latest R tar file  for Linux from http://ftp.iitm.ac.in/cran/

Extract the tarball

```tar   –xzvf   R-xxx.tar.gz
```

Change the permission of the extracted file

```chmod –R 755 R-xxx
```

then go inside the extracted R directory and do the following steps

``` cd   R.xxx
./configure  --enable-R-shlib
```

The above step may fail because of the lack of dependent libraries in your OS.
If it is failing, install the dependent libraries and do the above step again.
If this is done successfully, do the following steps.

```make

make install
```

After this set the R_HOME and PATH in /etc/bashrc (Redhat or CentOS) or ~/.bashrc (if no root privilege) or /etc/bash.bashrc (Ubuntu)

```export R_HOME= <path to R installation>
export PATH=\$PATH:\$R_HOME/bin
```

Then do the following command

`source /etc/bashrc ` (For Redhat or CentOS)

Or

`source /etc/bash.bashrc `   (If no root privilege)

Or

`source ~/.bashrc `  ( For Ubuntu)