S3 is a popular object storage service from Amazon Web Services. It is a common requirement to download files from the S3 bucket to Azure Databricks. You can mount object storage to the Databricks workspace, but in this example, I am showing how to recursively download and sync the files from a folder within an AWS S3 bucket to DBFS. This program will overwrite the files that already exist.

The program is given below. This program uses boto3 python package for interacting with AWS S3.

import os
import boto3
aws_access_key = ""
aws_secret_key = ""
bucket_name = ""
region_name = ""
# Path of the files in S3 bucket
prefix_path = ""
# Path to be written in DBFS
destination_path_prefix = ""
s3_client = boto3.client('s3', aws_access_key_id=aws_access_key, aws_secret_access_key=aws_secret_key, region_name= region_name)
response = s3_client.list_objects_v2(
Bucket=bucket_name,
Prefix=prefix_path
)
print(response)
for obj in response['Contents']:
key = obj['Key']
print(key)
local_path = f"/tmp/{os.path.basename(key)}"
s3_client.download_file(bucket_name, key, local_path)
move_status = dbutils.fs.mv(f"file://{local_path}", f"{destination_path_prefix}{os.path.basename(key)}", True)
print("Source –>%s, Destination–> %s, Status=%s" %(key, os.path.join(destination_path_prefix,os.path.basename(key)), move_status))

I hope this example is useful. Feel free to comment below if you have any feedback or questions.