Impala Scratch directory issue

In impala while running queries over large data, sometimes we may get an error like this.

WARNINGS: Create file /tmp/impala-scratch/94869b99d0d6457:765d2bc009a914ad_94869b99d0d6457:765d2bc009a914af_516bff1b-7342-434e-8c95-c777bb7f237e failed with errno=2 description=Error(2): No such file or directory

Backend 1:Create file /tmp/impala-scratch/94869b99d0d6457:765d2bc009a914ad_94869b99d0d6457:765d2bc009a914af_516bff1b-7342-434e-8c95-c777bb7f237e failed with errno=2 description=Error(2): No such file or directory

One of my friends faced this issue and on investigation I found that this issue is because of the unexpected deletion of files inside the impala scratch directory. The intermediate files used during large sort, join, aggregation, or analytic function operations are stored in this scratch directory. By default the impala scratch directory is /tmp/impala-scratch. These directoroes will be deleted after the query execution. The best solution for this problem is to change the scratch directory to some other directory. This can be done by starting the impala daemon with the option –scratch_dirs=”path_to_directory”. This directory is in the local linux file system. Not in the hdfs. Impala will not start if it is not having proper read/write access to the files in the “scratch” directory.

If you are using impala in EMR cluster, to modify the start up options, make the changes in the bootstrap action. If you want to modify this conf in an existing EMR cluster, stop the service nanny in all the nodes and restart the impala with this scratch directory property. If service nanny is running, you will not be able to restart the impala with this new argument because the service nanny will perform the service restart before your restart .. 🙂

Advertisements

A Brief Overview of Elasticsearch

This is the era of NoSQL databases. Elasticsearch is one of the popular NoSQL databases. It is a document database. It stores records as JSON. Once you get the basics of elastic search, it is very easy to work with elasticsearch. You can become a master in elasticsearch in just few days. You should know about JSON for proceeding with elasticsearch. Here I am explaining the quick installation and basic operations in elasticsearch. This may help some beginners to just start with this. For learning elasticsearch, you don’t need any server. You can use your desktop/laptop for installling elasticsearch.

Step 1:
Download the elasticsearch installable from elasticsearch website
https://www.elastic.co/downloads/elasticsearch

Extract the file. Go to the bin folder and execute the elasticsearch script.
Linux users should execute the elasticsearch script and windows users should execute the elasticsearch.bat file
Now your elasticsearch will be up and by default the data will be stored by under the folder $ELASTICSEARCH_HOME/data.
Check the url http://localhost:9200 in your browser. If some json like below is coming means your elasticsearch instance is running properly

{
"status" : 200,
"name" : "Armor",
"cluster_name" : "elasticsearch",
"version" : {
"number" : "1.4.0",
"build_hash" : "bc94bd81298f81c656893ab1ddddd30a99356066",
"build_timestamp" : "2014-11-05T14:26:12Z",
"build_snapshot" : false,
"lucene_version" : "4.10.2"
},
"tagline" : "You Know, for Search"
}

Step 2:

Now we have to load some data to elasticsearch. For that we need to send some REST requests.
Install the REST Client plugin in your browser. (In mozilla firefox, a plugin named REST CLIENT will help you. In chrome, a plugin named POST MASTER will help you).

Step 3:

Get some sample data for loading. I am providing some sample data. You can use something else also. In elasticsearch, we keep data under an index. An index is analogous to a database in RDBMS. In RDBMS, we have databases, tables and records. Here we have Index, Types and Documents. All the documents will have a unique key called Id.

Here I am adding car details to an index “car”, type “details”. I am giving the Ids from 1.
For adding the first record. Send a PUT request with the following details

URL : http://localhost:9200/car/details/1
METHOD : PUT
BODY:
{
"carName": "Fabia",
"manufacturer": "Skoda",
"type": "mini"
}

Similarly you can add the second record

URL : http://localhost:9200/car/details/2
METHOD : PUT
BODY:
{
"carName": "Yeti",
"manufacturer": "Skoda",
"type": "XUV"
}

The full dataset is available in github. You can add the complete records like this.

Step 4:

If you want to update a record, just do a PUT request similar to data load with the corresponding ID and new record. It will
update the data with new record and you can see a new version number. The old record will not be available after this.

Eg: Suppose If I want to change the record with Id 1. I am changing the car name from Fabia to FabiaNew.

URL : http://localhost:9200/car/details/1
METHOD : PUT
BODY:
{
"carName": "FabiaNew",
"manufacturer": "Skoda",
"type": "mini"
}

Step 5:
For getting all the indexes from elasticsearch. Do the following GET request.

METHOD: GET
http://localhost:9200/_aliases

Step 6:
For getting the data from elasticsearch, we can query elasticsearch. For getting everything from elasticsearch, do the following query.

METHOD: GET
http://localhost:9200/_search
http://localhost:9200/car/_search

Step 7:

For detailed queries you can try the following.

Query for getting all the vehicles with manufacturer “Skoda”.

METHOD: POST
http://localhost:9200/_search

{"query":

{
"query_string" : {
"fields" : ["manufacturer"],
"query" : "Skoda"
}
}
}

Query for getting all the vehicles with manufacturer Skoda or Renault

METHOD: POST
http://localhost:9200/_search

{"query":
{
"query_string" : {
"fields" : ["manufacturer"],
"query" : "Skoda OR Renault"
}
}
}

An Introduction to Apache Hive

Hive is an important member of hadoop ecosystem. It runs on top of hadoop.  Hive uses a SQL type query language to process the data in hdfs. Hive is very simple as compared to writing several lines of mapreduce codes using programming languages such as Java. Hive was developed by facebook in a vision to support their SQL experts to handle big data without much difficulty.  Hive queries are easy to learn for people who don’t know any programming languages.  People having experience in SQL can go straight forward with hive queries. The queries fired into hive will ultimately run as mapreduce.

Hive runs in two execution modes, local and distributed mode.

In local, the hive queries run as a single process and uses the local file system. In distributed mode, the mapper and reducer runs as different process and uses the hadoop distributed file system.

The installation of hive was explained well in my previous post Hive Installation.

Hive stores its contents in hdfs and table details (metadata) in some databases. By default the metadata is stored in derby database, but this is just for play around setups only and cannot be used for multiuser environments. For multiuser environments, we can use databases such as mysql, postgresql , oracle etc for storing the hive metadata. The data are stored in hdfs and it is contained in a location called hive warehouse directory which is defined by the property hive.metastore.warehouse.dir. By default this will be /user/hive/warehouse

We can fire queries into hive using a command line interface or using clients written in different programming languages. Hive server exposes a thrift service making hive accessible from various programming languages .

The simplicity and power of hive can be explained by comparing the word count program written in java program and in hive query.

The word count program written in java is well explained in my previous post A Simple Mapreduce Program – Wordcount . For that have to write a lot of lines of code and it will take time and it needs some good programming knowledge also.

The same word count can be done using hive query in a few lines of hive query.

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) word
GROUP BY word
ORDER BY word;