S3 is a storage service provided by Amazon. We can use this as a place to store, backup or archive our data. S3 is a storage which is accessible from the public network. So the data reaches S3 through internet. So while doing the data transmission to S3, one important thing that we have to ensure is the correctness of the data. Because if the data gets
corrupted while transferring, it will be a big problem. So we have to ensure the correctness of the data. This is possible only by comparing the S3 copy with the master copy. But how to achieve this ???
In local file system we can do the file comparison by calculating the checksum. But in S3 how we will perform this ?.
Calculating checksum involves reading the complete file. But do we have a provision to calculate the checksum in S3.?
Yes we have. We don’t have to calculate again, but use one of the properties of an S3 file to compare it with the source file. Every S3 file has a property called ETag. This etag is a checksum that is calculated while the file is transferred to S3. The tricky part is the way in which Etag is calculated. Etag can be calculated in different ways. So the Etag of a file may be different depending upon the way we transfer the file.
The funda is simple. The Etag of a file depends on the chunk size in which the file gets transferred to S3. So for validating a file, we have to find the etag of the S3 file and calculate a checksum of the local file using the same logic that is used to calculate the Etag of that file in S3. The etag calculation of files uploaded to S3 in normal way is simple and it will be equal to normal md5 checksum. But if we use multipart upload, then the Etag differs. Now the question arises, what is multipart upload ??
Inorder to transfer large files to S3, it is divide it into small parts and upload the parts in parallel and assemble at the S3 side. If we transmit a single large file directly, if some failure happens, the entire file transfer fails and restartability will be also difficult. But if we divide the large file into smaller chunks and transfer it in parallel, the transmission speed increases, the reliability also increases. If the transfer of a chunk fails, we can retry that chunk alone and hence improves the restartability.
Here I am giving an example of checking the Etag of a file and comparing it with the normal md5 checksum of the file.
|from boto.s3.connection import S3Connection|
|self.aws_access_key = "XXXXXXXXX"|
|self.aws_secret_key = "XXXXXXXXX"|
|self.s3_bucket = "checksum-test"|
|self.s3_conn = boto.connect_s3(aws_access_key_id=self.aws_access_key,|
|#Function to calculate the checksum of a local file|
|def find_checksum(self, file_name):|
|checksum = hashlib.md5(open(file_name).read()).hexdigest()|
|except Exception, e:|
|print "Exception occurred while calculating checksum :" + str(e)|
|#Function to calculate the Etag of a file in S3|
|def find_etag(self, full_key_name):|
|bucket = self.s3_conn.get_bucket(self.s3_bucket)|
|key = bucket.new_key(full_key_name)|
|s3_etag = key.etag.strip('"').strip("'")|
|except Exception, e:|
|print "Exception occurred while calculating S3 Etag : " + str(e)|
Suppose I have an S3 bucket with the name checksum-testand I have a file with with the name sample.txt which is of 100 MB inside the checksum-test bucket in a location file/sample.txt
Then the bucket name is checksum-test
full key name will be file/sample.txt