What is EMR.?
EMR is a cloud service provided by amazon. Its full form is Elastic Mapreduce.
We can launch hadoop clusters of our desired size in few minutes using this service.
We can simply increase or decrease the number of nodes in the cluster while running without any disturbance. That is why it is called as Elastic. It is very simple to operate and doesn’t require much administration skills. Pay for whatever we use, no need of server room , cooling mechanism, power backup etc. We will get everything very fast for affordable amount. We can configure hadoop , hadoop ecosystem components such as hive, pig, impala etc in an emr cluster.
Now shark and spark are also available with EMR. If we need any additional services to be iinstalled in our cluster, we can create our own custom bootstrap script for installing those services in the cluster and add the script while launching the cluster.
There are three types of nodes in an EMR cluster. Master, Core and Task.
Master node contains the master daemons in hadoop cluster such as Namenode and Jobtracker for MRv1 and Namenode and Resource Manager in case of YARN. Core node contains Datanode and Tasktracker for MRv1 and Datanode and Node manager for YARN. Task nodes contains the processing daemons only,ie tasktracker or nodemanager. After launching a cluster we can increase the number of Core nodes and Task nodes, but we can decrease only the number of task nodes. We can’t reduce the number of core nodes, because core nodes contains datanodes which will store, decreasing the number of datanodes may result in data loss.
A super cool library called Boto is available in python for dealing with EMR.
Why EMR cannot be launched in all type of VPCs.?
For launching an EMR, the VPC should have an internet gateway and a subnet. So if internet is restricted in the VPC, EMR cannot be launched. The reason for this is, while launching an EMR, it contacts with some remote locations for downloading the required softwares and installation scripts. So if internet is not available, that connection will be blocked which results in installation failure.