Introduction to Apache Spark

Big Data is very hot in market. Hadoop is one of the top rated technologies in big data. Hadoop became very popular in the market because of its elegant design, its capability to handle large structured/unstructured/semi-structured data and the better community support. Hadoop is a batch processing framework that can process data of any size. The only thing Hadoop guarantees is it will not fail because of load. Initially the requirement was to handle large data without any failure. This led to the design and development of frameworks such as hadoop. After that people started thinking about the performance improvements that can be made in this processing. This led to the development of a technology called spark.

Spark is an open source technology for processing large data in a distributed manner with some extra features compared to mapreduce. The processing speed of spark is higher than that of mapreduce. Most current cluster programming models are based on directed acyclic data flow from stable storage to stable storage. Acyclic data flow is inefficient for applications that repeatedly reuse a working set of data. The main motivation behind the development of spark is because of the inefficient handling of two types of applications such as Iterative Algorithms and Interactive data mining tools in the current computing frameworks. With current frameworks, applications reload data from stable storage on each query. If the reload of the same data happens multiple times, it will consume more time. This affects the processing speed. If this happens in case of large data processing, the time loss will be high. If we store the intermediate results of a process in memory and share the in-memory copy of results across the cluster resources, the time delay will be less which will results in performance improvement. In this way we can say that the performance improvement is higher with in-memory computations. The inability to keep intermediate results in memory is one of the major drawback in most of the popular distributed data processing technologies. This is requirement of in-memory computation is mainly in iterative Algorithms and data mining applications. Spark achieves this in memory computation with RDDs. The back end of spark is RDD (Resilient Distributed Datasets).

RDD is a distributed memory abstraction that helps programmers to perform in-memory computation on very large clusters in an error free manner. An RDD is a read-only, partitioned collection of records. An RDD has enough information about how it was derived from other datasets. RDDs are immutable collections of objects spread across a cluster.


Spark is rich with several features because of the modules build on spark.

  • Spark Streaming: processing real-time data streams
  • Spark SQL and DataFrames: support for structured data and relational queries
  • MLlib: built-in machine learning library
  • GraphX: Spark’s new API for graph processing
  • Bagel (Pregel on Spark): older, simple graph processing model

Is spark a replacement of hadoop ?

Spark is not a replacement for hadoop. It can work along with hadoop. It can use hadoop’s file system-HDFS as the storage layer. It can run on the existing hadoop cluster. Now spark became one of the most active projects in the hadoop ecosystem. The comparison happens only with the processing layer-Mapreduce. As per the current test results, spark is performing much better than mapreduce. Spark has several advantages of mapreduce. Spark is still under development and more features are coming up. The realtime stream processing is better in spark compared to other ecosystem components in hadoop. The detailed performance report of spark is available in the following url.

Is Spark free or Licensed?

Spark is a 100 % open source project. Now it became an apache project with several committers and contributors across the world.

What are the attracting features in spark in comparison with Mapreduce ?

  • Spark is having Java, Scala and Python APIs.
  • Programming spark is simpler as compared to programming mapreduce. This reduces the development time.
  • The performance of spark is better compared to mapreduce. It is best suited for computations such as realtime processing, iterative computations etc on similar data.
  • Caching is one main feature in spark. Spark stores the intermediate result in memory across its distributed workers. Mapreduce stores the intermediate results on disk. The in memory caching feature of spark makes it faster. The spark streaming provides a realtime data processing feature on the fly of data flow which is missing in case of mapreduce.
  • With spark, it is possible to obtain batch processing, streaming processing, graph processing and machine learning in the same cluster. This provides better resource utilization and easy resource management.
  • Spark has an excellent feature of spilling the data partitions to disk if the node is not having sufficient RAM for storing the data partitions.
  • All these features made spark a very powerful member in the bigdata technology stack and this will be the one of the hottest technologies that is going to capture the market.

About amalgjose
I am an Electrical Engineer by qualification, now I am working as a Software Engineer. I am very much interested in Electrical, Electronics, Mechanical and now in Software fields. I like exploring things in these fields. I like travelling, long drives and very much addicted to music.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: