Advertisements

Find the file name corresponding to a record in hive

Every table in hive has two virtual columns. They are

  • INPUT__FILE__NAME
  • BLOCK__OFFSET__INSIDE__FILE

INPUT__FILE__NAME give the name of the file.

BLOCK__OFFSET__INSIDE__FILE is the current global file position.

Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from hive versions above 0.8. A small example is given below.

Table Customer Details

create table customer_details ( name string, phone_number string) row format delimited fields terminated by ',';

Sample data set

datafile1.csv

amal,9876543210
sreesankar,8976543210
sahad,7896543210

datafile2.csv

sivaram76896543210
rupak,7568943210
venkatesh,9087654321

Query

select INPUT__FILE__NAME, name from customer_data;

This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.

select distinct(INPUT__FILE__NAME) from customer_data;
Advertisements

Programmatic Data Upload to Amazon S3

S3 is a service provided by Amazon for storing data. The full form is Simple Storage Service. S3 is a very useful service for less price. Data can be uploaded to and downloaded from S3 very easily using some tools as well as program. Here I am explaining  a sample program for uploading file to S3 using a python program.

Files can be uploaded to S3 in two approaches. One is the normal upload and another is the multipart upload. Normal upload sends the file serially and is not suitable for large files. It will take more time. For large files, multipart upload is the best option. It will upload the file by dividing it into chunks and sends it in parallel and collects it in S3.

This program is using the normal approach for sending the files to S3. Here I used the boto library for uploading the files.

Python code to find the md5 checksum of a file

Checksum calculation is an unavoidable and very important step in places where we transfer files/data. The simplest way to ensure whether a file reached the destination properly or not is by comparing the checksum of source and target files. Checksum can be calculated in several ways. One is by calculating the checksum by keeping the entire file as a single block. Another way is multipart checksum calculation, where we calculate the checksum of multiple small chunks in the file and finally calculating the aggregated checksum.
Here I am explaining about the calculation of checksum of a file using the simplest way. I am using the hashlib library in python for calculating the checksum.
Suppose I have a zip file located in the location /home/coder/data.zip. The checksum of the file can be calculated as follows.

import hashlib
file_name = ‘/home/amal/data.zip’
checksum = hashlib.md5(open(file_name).read()).hexdigest()
print checksum

One common mistake I have seen among people is passing the file name directly without opening the file

Eg: hashlib.md5(file_name).hexdigest()

This will also return a checksum. But it will be calculating the checksum of the file name, not the checksum calculated based on the contents of the file. So always use the checksum calculation as follows

hashlib.md5(open(file_name).read()).hexdigest()

This will return the exact checksum.

In linux, you can calculate the md5sum using a commandline utility also.

> md5sum file_name