Azure Data Lake Storage is a scalable file system from Microsoft for storing large data. This is suitable for Enterprise Data Lakes. This file system is very popular now a days because of the huge Azure adoption happening across enterprises.

The ABFS connector and Hadoop Azure Data Lake connector modules provides support for integration with the Azure Data Lake Storages.

These connectors are already present in the hadoop distribution provided by Azure – HDInsights. So Azure HDInsights users does not have to make any changes in their system to interact with Azure Data Lake Store (ADLS Gen2).

For more details. Refer to the Apache Hadoop Website

A sample pyspark program that interacts with the Azure Data Lake Storage is given below. Here I am demonstrating delete and check operations.

from pyspark.sql import SparkSession
# Author: Amal G Jose
# Reference: https://amalgjose.com
# prepare spark session
spark = SparkSession.builder.appName('filesystemoperations').getOrCreate()
# spark context
sc = spark.sparkContext
# set ADLS file system URI
sc._jsc.hadoopConfiguration().set('fs.defaultFS', 'abfs://CONTAINER@ACCOUNTNAME.dfs.core.windows.net/')
# FileSystem manager
fs = (sc._jvm.org
.apache.hadoop
.fs.FileSystem
.get(sc._jsc.hadoopConfiguration())
)
# Enter the ADLS path
path = "Your/adls/path"
# Delete the file or directory in ADLS using the below command
deletion_status = fs.delete(sc._jvm.org.apache.hadoop.fs.Path(path), True)
print("Deletion status –>", deletion_status)
# check whether the file or directory got deleted. This will return True if exists and False if does not
status = fs.exists(sc._jvm.org.apache.hadoop.fs.Path(path))
print("Status –>", status)
Advertisement