Hive error in a sentry enabled cluster – “add jar” command throws “Insufficient privileges to execute add” –

Apache Sentry is a system for enforcing fine grained role based authorization to data and metadata stored on a Hadoop cluster. This is a very useful system for securing a cluster. Using sentry we can configure fine grained access to databases, directories, tables in hive and impala. Before sentry, the only way to limit access is using hdfs directory permissions and that is also not effective.

In a sentry enabled cluster, while adding jars using the command “add jar”, you will face an exception as below.

"Insufficient privileges to execute add"

You will not be able to perform add jar command from admin user also. Sentry is limiting the access the privilege to add jar from hue or beeline. For this problem, the solution is to add jar with the help of an admin globally using hive.aux.jars.path.

Advertisements

Impala Scratch directory issue

In impala while running queries over large data, sometimes we may get an error like this.

WARNINGS: Create file /tmp/impala-scratch/94869b99d0d6457:765d2bc009a914ad_94869b99d0d6457:765d2bc009a914af_516bff1b-7342-434e-8c95-c777bb7f237e failed with errno=2 description=Error(2): No such file or directory

Backend 1:Create file /tmp/impala-scratch/94869b99d0d6457:765d2bc009a914ad_94869b99d0d6457:765d2bc009a914af_516bff1b-7342-434e-8c95-c777bb7f237e failed with errno=2 description=Error(2): No such file or directory

One of my friends faced this issue and on investigation I found that this issue is because of the unexpected deletion of files inside the impala scratch directory. The intermediate files used during large sort, join, aggregation, or analytic function operations are stored in this scratch directory. By default the impala scratch directory is /tmp/impala-scratch. These directoroes will be deleted after the query execution. The best solution for this problem is to change the scratch directory to some other directory. This can be done by starting the impala daemon with the option –scratch_dirs=”path_to_directory”. This directory is in the local linux file system. Not in the hdfs. Impala will not start if it is not having proper read/write access to the files in the “scratch” directory.

If you are using impala in EMR cluster, to modify the start up options, make the changes in the bootstrap action. If you want to modify this conf in an existing EMR cluster, stop the service nanny in all the nodes and restart the impala with this scratch directory property. If service nanny is running, you will not be able to restart the impala with this new argument because the service nanny will perform the service restart before your restart .. 🙂

Introduction to Apache Spark

Big Data is very hot in market. Hadoop is one of the top rated technologies in big data. Hadoop became very popular in the market because of its elegant design, its capability to handle large structured/unstructured/semi-structured data and the better community support. Hadoop is a batch processing framework that can process data of any size. The only thing Hadoop guarantees is it will not fail because of load. Initially the requirement was to handle large data without any failure. This led to the design and development of frameworks such as hadoop. After that people started thinking about the performance improvements that can be made in this processing. This led to the development of a technology called spark.

Spark is an open source technology for processing large data in a distributed manner with some extra features compared to mapreduce. The processing speed of spark is higher than that of mapreduce. Most current cluster programming models are based on directed acyclic data flow from stable storage to stable storage. Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data. The main motivation behind the development of spark is because of the inefficient handling of two types of applications such as Iterative Algorithms and Interactive data mining tools in the current computing frameworks. With current frameworks, applications reload data from stable storage on each query. If the reload of the same data happens multiple times, it will consume more time. This affects the processing speed. If this happens in case of large data processing, the time loss will be high. If we store the intermediate results of a process in memory and share the in-memory copy of results across the cluster resources, the time delay will be less which will results in performance improvement. In this way we can say that the performance improvement is higher with in-memory computations. The inability to keep intermediate results in memory is one of the major drawback in most of the popular distributed data processing technologies. This is requirement of in-memory computation is mainly in iterative Algorithms and data mining applications. Spark achieves this in memory computation with RDDs. The back end of spark is RDD (Resilient Distributed Datasets).

RDD is a distributed memory abstraction that helps programmers to perform in-memory computation on very large clusters in an error free manner. An RDD is a read-only, partitioned collection of records. An RDD has enough information about how it was derived from other datasets. RDDs are immutable collections of objects spread across a cluster.

spark

Spark is rich with several features because of the modules build on spark.

  • Spark Streaming: processing real-time data streams
  • Spark SQL and DataFrames: support for structured data and relational queries
  • MLlib: built-in machine learning library
  • GraphX: Spark’s new API for graph processing
  • Bagel (Pregel on Spark): older, simple graph processing model

Is spark a replacement of hadoop ?

Spark is not a replacement for hadoop. It can work along with hadoop. It can use hadoop’s file system-HDFS as the storage layer. It can run on the existing hadoop cluster. Now spark became one of the most active projects in the hadoop ecosystem. The comparison happens only with the processing layer-Mapreduce. As per the current test results, spark is performing much better than mapreduce. Spark has several advantages of mapreduce. Spark is still under development and more features are coming up. The realtime stream processing is better in spark compared to other ecosystem components in hadoop. The detailed performance report of spark is available in the following url.

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

Is Spark free or Licensed?

Spark is a 100 % open source project. Now it became an apache project with several committers and contributors across the world.

What are the attracting features in spark in comparison with Mapreduce ?

  • Spark is having Java, Scala and Python APIs.
  • Programming spark is simpler as compared to programming mapreduce. This reduces the development time.
  • The performance of spark is better compared to mapreduce. It is best suited for computations such as realtime processing, iterative computations etc on similar data.
  • Caching is one main feature in spark. Spark stores the intermediate result in memory across its distributed workers. Mapreduce stores the intermediate results on disk. The in memory caching feature of spark makes it faster. The spark streaming provides a realtime data processing feature on the fly of data flow which is missing in case of mapreduce.
  • With spark, it is possible to obtain batch processing, streaming processing, graph processing and machine learning in the same cluster. This provides better resource utilization and easy resource management.
  • Spark has an excellent feature of spilling the data partitions to disk if the node is not having sufficient RAM for storing the data partitions.
  • All these features made spark a very powerful member in the bigdata technology stack and this will be the one of the hottest technologies that is going to capture the market.