Python program to download files recursively from AWS S3 bucket to Databricks DBFS

S3 is a popular object storage service from Amazon Web Services. It is a common requirement to download files from the S3 bucket to Azure Databricks. You can mount object storage to the Databricks workspace, but in this example, I am showing how to recursively download and sync the files from a folder within an AWS S3 bucket to DBFS. This program will overwrite the files that already exist.

The program is given below. This program uses boto3 python package for interacting with AWS S3.

	import os
	import boto3

	aws_access_key = ""
	aws_secret_key = ""
	bucket_name = ""
	region_name = ""
	# Path of the files in S3 bucket
	prefix_path = ""
	# Path to be written in DBFS
	destination_path_prefix = ""

	s3_client = boto3.client('s3', aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key, region_name= region_name)

	response = s3_client.list_objects_v2(
	Bucket=bucket_name,
	Prefix=prefix_path
	)
	print(response)
	for obj in response['Contents']:
	key = obj['Key']
	print(key)
	local_path = f"/tmp/{os.path.basename(key)}"
	s3_client.download_file(bucket_name, key, local_path)
	move_status = dbutils.fs.mv(f"file://{local_path}", f"{destination_path_prefix}{os.path.basename(key)}", True)
	print("Source –>%s, Destination–> %s, Status=%s" %(key, os.path.join(destination_path_prefix,os.path.basename(key)), move_status))

view raw download_from_s3_to_dbfs.py hosted with ❤ by GitHub

I hope this example is useful. Feel free to comment below if you have any feedback or questions.

All About Tech

Victory goes to the player who makes the next-to-last mistake

Python program to download files recursively from AWS S3 bucket to Databricks DBFS

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply