Recently I had to query a REST API that returns data with size of around 4-5 GB. This result need to be uploaded to AWS S3. But the main challenge was to fit this function in AWS Lambda. AWS Lambda has strict resource limits. So if the program consumes more than the limit, it will fail.

http_to_s3

Initially I tried the approach where I query the REST API, get the complete response and upload the full response to S3 in one go. But this approach failed as the data size was very large and AWS lambda has a max memory limit of 3 GB. The response got loaded into the memory and it exceeded the memory limit and finally the function crashed.

There are multiple approaches to solve this problem. This is one of the approaches. In this approach, we will be reading the data streams and uploading it to S3 using multipart upload. This program will not save the file locally or cache it completely. This will relay the file directly to S3. This program works well with large data sizes and is suitable to use in AWS Lambda functions or in environments that has very limited resources (Memory & CPU).

The program uses python boto3 package and uploads the data in multipart. In this way, it will not hold large data in the memory. The program acts like a relay between the source and the S3.

The program is very simple and the code is shared below. I hope this will help someone.

import boto3
import requests
authentication = {"USER": "", "PASSWORD": ""}
payload = {"query": "some query"}
session = requests.Session()
response = session.post("URL",
data=payload,
auth=(authentication["USER"],
authentication["PASSWORD"]), stream=True)
s3_bucket = "bucket_name"
s3_file_path = "path_in_s3"
s3 = boto3.client('s3')
with response as part:
part.raw.decode_content = True
conf = boto3.s3.transfer.TransferConfig(multipart_threshold=10000, max_concurrency=4)
s3.upload_fileobj(part.raw, s3_bucket, s3_file_path, Config=conf)

view raw
stream_to_s3.py
hosted with ❤ by GitHub