How to set up Delta Lake in Apache Spark ?

Delta lake is supported in the latest version of Apache Spark. Delta Lake is open sourced with Apache 2.0 license. So it is free to use. Delta Lake is supported in Apache Spark versions above 2.4.2. It is very easy to set up and it does not require any admin skills to configure. Delta Lake is available by default in Databricks. We don’t have to do any installation or configuration to use this Delta Lake in Databricks.

For trying out the basic example, launch pyspark or spark-shell by adding the delta package. No need of any additional installation. Just use the following command

For pyspark

pyspark --packages io.delta:delta-core_2.11:0.4.0

For spark-shell

bin/spark-shell --packages io.delta:delta-core_2.11:0.4.0

The above command/s will add delta package to the context and delta lake will be enabled. You can try out the following basic example in the pyspark shell.

	%python
	# Create a temparory dataset
	data = spark.range(0, 50)
	data.write.format("delta").save("/tmp/myfirst-delta-table")

	# Read the data
	df = spark.read.format("delta").load("/tmp/myfirst-delta-table")
	df.show()

	# Updating the dataset
	data = spark.range(51, 100)
	data.write.format("delta").mode("overwrite").save("/tmp/myfirst-delta-table")

	# Read the data
	df = spark.read.format("delta").load("/tmp/myfirst-delta-table")
	df.show()

	# Read the older version of data
	df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/myfirst-delta-table")
	df.show()

view raw

deltalake-example.py

hosted with ❤ by GitHub

All About Tech

Victory goes to the player who makes the next-to-last mistake

How to set up Delta Lake in Apache Spark ?

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply