Add partitions to hive table with location as S3

Recently I tried to add a partition to a hive table with S3 as the storage. The command I tried is given below.

ALTER table mytable ADD PARTITION (testdate='2015-03-05') location 's3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05';

I got the following exceptions

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.SentryFilterDDLTask. MetaException(message:Got exception: org.apache.hadoop.fs.FileAlreadyExistsException Can't make directory for path 's3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable' since it is a file.) (state=08S01,code=1)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:s3a://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05 is not a directory or unable to create one)

Solution:

Use S3n instead of S3a. It will work. So the S3 url should be

s3n://XXXACCESS-KEYXXXXX:XXXSECRET-KEYXXX@bucket-name/DATA/mytable/testdate=2015-03-05

 

 

Advertisements

Unauthorized connection for super-user: hue from IP “x.x.x.x”

If you are getting the following error in hue,

Unauthorized connection for superuser: hue from IP “x.x.x.x”

Add the following property in the core-site.xml of your hadoop cluster and restart the cluster

<property>
<name>hadoop.proxyuser.hue.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.hue.hosts</name>
<value>*</value>
</property>

You may face similar error with oozie also. In that case add a similar conf for oozie user in the core-sire.xml

<property>
<name>hadoop.proxyuser.oozie.groups</name>
<value>*</value>
</property>

<property>
<name>hadoop.proxyuser.oozie.hosts</name>
<value>*</value>
</property>

Find the file name corresponding to a record in hive

Every table in hive has two virtual columns. They are

  • INPUT__FILE__NAME
  • BLOCK__OFFSET__INSIDE__FILE

INPUT__FILE__NAME give the name of the file.

BLOCK__OFFSET__INSIDE__FILE is the current global file position.

Suppose if we want to find the name of the file corresponding to each record in a file. We can use the INPUT__FILE__NAME column. This feature is available from hive versions above 0.8. A small example is given below.

Table Customer Details

create table customer_details ( name string, phone_number string) row format delimited fields terminated by ',';

Sample data set

datafile1.csv

amal,9876543210
sreesankar,8976543210
sahad,7896543210

datafile2.csv

sivaram76896543210
rupak,7568943210
venkatesh,9087654321

Query

select INPUT__FILE__NAME, name from customer_data;

This will give us the file name corresponding to each record. If you want to get the file names corresponding to a hive table, the below query will help you.

select distinct(INPUT__FILE__NAME) from customer_data;

Hive error in a sentry enabled cluster – “add jar” command throws “Insufficient privileges to execute add” –

Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. This is a very useful system for securing a cluster. Using sentry we can configure fine grained access to databases, directories, tables in hive and impala. Before sentry, the only way to limit access is using hdfs directory permissions and that is also not effective.

In a sentry enabled cluster, while adding jars using the command “add jar”, you will face an exception as below.

"Insufficient privileges to execute add"

You will not be able to perform add jar command from admin user also. Sentry is limiting the access the privilege to add jar from hue or beeline. For this problem, the solution is to add jar with the help of an admin globally using hive.aux.jars.path.

Facebook Opensourced Presto

Facebook opensourced its data processing technique ‘Presto’ to the world. Presto is a distributed query engine based on ANSI SQL. It is very optimized and currently running with more than 300 petabytes of data, which may one among the top big data processing systems. Presto is a totally different from mapreduce. It is an in memory data processing mechanism and is very much optimised. From the details given in the facebook newsletter and presto website, it is 10 times faster than Hive. mapreduce.Hive came from facebook only, so presto will definitely beat hive. Hive queries are ultimately running as multiple mapreduce jobs and it will take more time. From my point of view, the competition may be between Cloudera Impala and Presto. Impala’s performance with huge datasets is not available now from any production environments because it is a budding technology from cloudera family, but presto is already tested and running in huge dataset production environment. Another interesting fact about presto is that we can use the already existing infrastructure and hadoop cluster for deploying presto, because presto supports hdfs as its underlying data storage. It supports other storage systems also. So it is flexible. Leading internet companies including Airbnb and Dropbox are using Presto. Presto code and further details are available in this link

I have deployed Presto and Impala on a small cluster of 8 nodes. I haven’t got enough time to explore more on presto. I am planning to explore more on the coming days. 🙂

Simple Hive JDBC Client

Here I am explaining a sample hive jdbc client. With this we can fire hive queries from java programs. The only thing is that we need to start the hive server. By default, hive server listens at port 10000. The sample program is given below. The program is self explanatory and you can rewrite it to execute any type of hive queries. For this program you need the mysql-connector jar in the classpath.

import java.sql.SQLException;
import java.sql.Connection;
import java.sql.ResultSet;
import java.sql.Statement;
import java.sql.DriverManager;

  /*
   * @author
   * Amal G Jose
   * 
   */

public class HiveJdbc
  private static String driver = "org.apache.hadoop.hive.jdbc.HiveDriver";

  /**
   * @param args
   * @throws SQLException
   */
  public static void main(String[] args) throws SQLException {
      try {
      Class.forName(driver);
    } catch (ClassNotFoundException e) {
      e.printStackTrace();
      System.exit(1);
    }

    Connection connect = DriverManager.getConnection("jdbc:hive://:10000/default", "", "");
    Statement state = connect.createStatement();
    String tableName = "test";
    state.executeQuery("drop table " + tableName);
    ResultSet res = state.executeQuery("create table " + tableName + " (key int, value string)");
   
   // Query to show tables
    String show = "show tables";
    System.out.println("Running: " + show);
    res = state.executeQuery(show);
    if (res.next()) {
      System.out.println(res.getString(1));
    }

    // Query to describe table
    String describe = "describe " + tableName;
    System.out.println("Running: " + describe);
    res = state.executeQuery(describe);
    while (res.next()) {
      System.out.println(res.getString(1) + "\t" + res.getString(2));
    }

  }
}

An Introduction to Apache Hive

Hive is an important member of hadoop ecosystem. It runs on top of hadoop.  Hive uses a SQL type query language to process the data in hdfs. Hive is very simple as compared to writing several lines of mapreduce codes using programming languages such as Java. Hive was developed by facebook in a vision to support their SQL experts to handle big data without much difficulty.  Hive queries are easy to learn for people who don’t know any programming languages.  People having experience in SQL can go straight forward with hive queries. The queries fired into hive will ultimately run as mapreduce.

Hive runs in two execution modes, local and distributed mode.

In local, the hive queries run as a single process and uses the local file system. In distributed mode, the mapper and reducer runs as different process and uses the hadoop distributed file system.

The installation of hive was explained well in my previous post Hive Installation.

Hive stores its contents in hdfs and table details (metadata) in some databases. By default the metadata is stored in derby database, but this is just for play around setups only and cannot be used for multiuser environments. For multiuser environments, we can use databases such as mysql, postgresql , oracle etc for storing the hive metadata. The data are stored in hdfs and it is contained in a location called hive warehouse directory which is defined by the property hive.metastore.warehouse.dir. By default this will be /user/hive/warehouse

We can fire queries into hive using a command line interface or using clients written in different programming languages. Hive server exposes a thrift service making hive accessible from various programming languages .

The simplicity and power of hive can be explained by comparing the word count program written in java program and in hive query.

The word count program written in java is well explained in my previous post A Simple Mapreduce Program – Wordcount . For that have to write a lot of lines of code and it will take time and it needs some good programming knowledge also.

The same word count can be done using hive query in a few lines of hive query.

CREATE TABLE docs (line STRING);
LOAD DATA INPATH 'text' OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count FROM
(SELECT explode(split(line, '\s')) AS word FROM docs) word
GROUP BY word
ORDER BY word;

Migrating hive from one hadoop cluster to another cluster

Recently I have migrated a hive installation from one cluster to another cluster. I havent find any
document regarding this migration. So I did it with my experience and knowledge.
Hive stores the metadata in some databases, ie it stores the data about the tables in some database.
For developement/ production grade installations, we normally use mysql/oracle/postgresql databases.
Here I am explaining about the migration of the hive with its metastore database in mysql.
The metadata contains the information about the tables. The contents of the table are stored in hdfs.
So the metadata contains hdfs uri and other details. So if we migrate hive from one cluster to another
cluster, we have to point the metadata to the hdfs of new cluster. If we haven’t do this, it will point
to the hdfs of older cluster.

For migrating a hive installation, we have to do the following things.

1) Install hive in the new hadoop cluster
2) Transfer the data present in the hive metastore directory (/user/hive/warehouse) to the new hadoop
cluster
3) take the mysql metastore dump.
4) Install mysql in the new hadoop cluster
5) Open the hive mysql-metastore dump using text readers such as notepad, notepad++ etc and search for
hdfs://ip-address-old-namenode:port and replace with hdfs://ip-address-new-namenode:port and save it.

Where ip-address-old-namenode is the ipaddress of namenode of old hadoop cluster and ip-address-
new-namenode
is the ipaddress of namenode of new hadoop cluster.

6) After doing the above steps, restore the editted mysql dump into the mysql of new hadoop cluster.
7) Configure hive as normal and do the hive schema upgradations if needed.

This is a solution that I discovered when I faced the migration issues. I dont know whether any other
standard methods are available.
This worked for me perfectly. 🙂

Hadoop commands in hive command line interface

We can execute hadoop commands in hive cli. It is very simple.
Just put an exclamation mark (!) before your hadoop command in hive cli and put a semicolon (;) after your command.

Example:

hive> !hadoop fs –ls / ;

drwxr-xr-x   - hdfs supergroup          0 2013-03-20 12:44 /app
drwxrwxrwx   - hdfs supergroup          0 2013-05-23 11:54 /tmp
drwxr-xr-x   - hdfs supergroup          0 2013-05-08 18:47 /user

Very simple.. 🙂

Making hive usable to multiple users in a hadoop cluster.

By default, hive operations are limited to the superuser. If you are using cdh, then the superuser is hdfs.
The reason for this is because of the permission of hive warehouse directory.
By default the read/permission of this directory is given only to the superuser.
So if we want to use hive from multiple users, change the permission of this directory accordingly.
If you want to make hive usable by all users, then do the following command.

hadoop fs –chmod –R 777 /user/hive/warehouse

hadoop fs –chmod –R 777 /tmp

If you group the users in specific groups, then you can do this by giving read/write permission to group only. ie 775