Advertisements

A Brief Overview of Elasticsearch

This is the era of NoSQL databases. Elasticsearch is one of the popular NoSQL databases. It is a document database. It stores records as JSON. Once you get the basics of elastic search, it is very easy to work with elasticsearch. You can become a master in elasticsearch in just few days. You should know about JSON for proceeding with elasticsearch. Here I am explaining the quick installation and basic operations in elasticsearch. This may help some beginners to just start with this. For learning elasticsearch, you don’t need any server. You can use your desktop/laptop for installling elasticsearch.

Step 1:
Download the elasticsearch installable from elasticsearch website
https://www.elastic.co/downloads/elasticsearch

Extract the file. Go to the bin folder and execute the elasticsearch script.
Linux users should execute the elasticsearch script and windows users should execute the elasticsearch.bat file
Now your elasticsearch will be up and by default the data will be stored by under the folder $ELASTICSEARCH_HOME/data.
Check the url http://localhost:9200 in your browser. If some json like below is coming means your elasticsearch instance is running properly

{
"status" : 200,
"name" : "Armor",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

Step 2:

Now we have to load some data to elasticsearch. For that we need to send some REST requests.
Install the REST Client plugin in your browser. (In mozilla firefox, a plugin named REST CLIENT will help you. In chrome, a plugin named POST MASTER will help you).

Step 3:

Get some sample data for loading. I am providing some sample data. You can use something else also. In elasticsearch, we keep data under an index. An index is analogous to a database in RDBMS. In RDBMS, we have databases, tables and records. Here we have Index, Types and Documents. All the documents will have a unique key called Id.

Here I am adding car details to an index “car”, type “details”. I am giving the Ids from 1.
For adding the first record. Send a PUT request with the following details

URL : http://localhost:9200/car/details/1
METHOD : PUT
BODY:
{
"carName": "Fabia",
"manufacturer": "Skoda",
"type": "mini"
}

Similarly you can add the second record

URL : http://localhost:9200/car/details/2
METHOD : PUT
BODY:
{
"carName": "Yeti",
"manufacturer": "Skoda",
"type": "XUV"
}

The full dataset is available in github. You can add the complete records like this.

Step 4:

If you want to update a record, just do a PUT request similar to data load with the corresponding ID and new record. It will
update the data with new record and you can see a new version number. The old record will not be available after this.

Eg: Suppose If I want to change the record with Id 1. I am changing the car name from Fabia to FabiaNew.

URL : http://localhost:9200/car/details/1
METHOD : PUT
BODY:
{
"carName": "FabiaNew",
"manufacturer": "Skoda",
"type": "mini"
}

Step 5:
For getting all the indexes from elasticsearch. Do the following GET request.

METHOD: GET
http://localhost:9200/_aliases

Step 6:
For getting the data from elasticsearch, we can query elasticsearch. For getting everything from elasticsearch, do the following query.

METHOD: GET
http://localhost:9200/_search
http://localhost:9200/car/_search

Step 7:

For detailed queries you can try the following.

Query for getting all the vehicles with manufacturer “Skoda”.

METHOD: POST
http://localhost:9200/_search

{"query":

{
"query_string" : {
"fields" : ["manufacturer"],
"query" : "Skoda"
}
}
}

Query for getting all the vehicles with manufacturer Skoda or Renault

METHOD: POST
http://localhost:9200/_search

{"query":
{
"query_string" : {
"fields" : ["manufacturer"],
"query" : "Skoda OR Renault"
}
}
}
Advertisements

Sample program with Hadoop Counters and Distributed Cache

Counters are very useful feature in hadoop. This helps us in tracking global events in our job, ie across map and reduce phases.
When we execute a mapreduce job, we can see a lot of counters listed in the logs. Other than the default built-in counters, we can create our own custom counters. The custom counters will be listed along with the built-in counters.
This helps us in several ways. Here I am explaining a scenario where I am using a custom counter for counting the number of good words and stop words in the given text files. The stop words in this program are provided at the run time using distributed cache.
This is a mapper only job. The property job.setNumReduceTasks(0) makes the it a mapper only job.

Here I am introducing another feature in hadoop called Distributed Cache.
Distributed cache will distribute application specific read only files efficiently through out the application.
My requirement is to filter the stop words from input text files. The stop words list may vary. So if I hard code the list in my program, I have to update the code everytime to make changes in the stop word list. This is not a good practice. I used distributed cache for this and the file containing the stop words is loaded to the distributed cache. This makes the file available to mapper as well as reducer. In this program, we don’t require any reducer.

The code is attached below. You can also get the code from the github.

Create a java project with the above java classes. Add the dependent java libraries.(Libraries will be present in your hadoop installation). Export the project as a runnable jar and execute. The file containing the stop words should be present in hdfs. The stop words should be added line by line in the stop word file. Sample format is given below.

is

the

am

are

with

was

were

Sample command to execute the program is given below.

hadoop jar <jar-name>  -skip  <stop-word-file-in hdfs>   <input-data-location>    <output-location>

Eg:  hadoop jar Skipper.jar  -skip /user/hadoop/skip/skip.txt     /user/hadoop/input     /user/hadoop/output

In the job logs, you can see the custom counters also. I am attaching a sample log below.

Counters