Advertisements

Ways to find out the count of unique items in a list using python

Last day I was trying for a method to find the count of unique items in a list. I found two solutions. Thought like it is worth to share here.

Method 1:

This method works only with python versions 2.7 and above. The collections library is not available in python versions less than 2.7.

Method 2:

This is a very simple method using python dictionary. This will work in all versions of python.

Advertisements

A simple program to begin with python tornado

Python tornado is a powerful framework for dealing with HTTP requests. It helps us to write responses to HTTP requests in a very simple and elegant way. We can write handlers for responding to the HTTP requests very easily using tornado. This is very simple to learn and use. We can create an excellent application in very few lines of code. Here I am explaining about a simple tornado application. The code is attached below. For running this code, you need tornado to be installed in your machine.

In the beginning of the code, you can see few imports related to tornado. These libraries are required for our tornado application. After that you can see a define function. This define is imported from a library called options in tornado. Using this we can get user defined arguments from commandline. Here we are defining the port in which our tornado application should run. If the user is specifying the port in the command line, it will use that value, else it will use the default value. Here I gave default value as 8888. 

The next part is a class named HelloWorldHandler. This is a class extending the tornado RequestHandler class. This is basically a handler, which means this will handle an HTTP request. This class will be called based on the navigation rules that we define in the tornado. In this class there is only one method called get(). So this handler can handle only get requests. In the get method, we are just printing a text “Someone called me” and writing a response. So whenever this class is called, The text “Someone called me” will be printed in the console and the self.write(“Welcome to Tornado..!!”) will send the this string to the HTTP response.

The next part will run the tornado application. The “tornado.web.Application(handlers=[(r”/”, HelloWorldHandler)])” defines when to invoke the handler. the r”/” is a regex. So if the url comes without any path, the request will be navigated to HelloWorldHandler class. Similar to this we can have a list of regex – handler class pairs. Here we have only one.

Execution.

python HelloTornado.py –port 9090

This will run the application in 9090 port. After this open the web browser and check http://localhost:9090. You will get a message “Welcome to Tornado.!!” on the screen. For every hit, you can see a message “Someone called me” getting printed in the console.

python HelloWorld.py

This will run the application in the default port that we specified. I specified 8888. So open the webbrowser and check http://localhost:8888.

Note: If you are executing the code in a different machine, you should use the ip address of the machine instead of localhost.

Happy Learning … 🙂

Python code for calculating the difference between two time stamps

I was searching for a way to find the difference between two timestamps. My requirement is to get the difference in terms of years, months, days, hours and minutes. I found a way to get it. The below code contains the logic to get the required output. I haven’t seen this code anywhere in internet, that is the reason I am posting this here so that this will be helpful for someone.

How to validate a file in S3

S3 is a storage service provided by Amazon. We can use this as a place to store, backup or archive our data. S3 is a storage which is accessible from the public network. So the data reaches S3 through internet. So while doing the data transmission to S3, one important thing that we have to ensure is the correctness of the data. Because if the data gets

corrupted while transferring, it will be a big problem. So we have to ensure the correctness of the data. This is possible only by comparing the S3 copy with the master copy. But how to achieve this ???

In local file system we can do the file comparison by calculating the checksum. But in S3 how we will perform this ?.
Calculating checksum involves reading the complete file. But do we have a provision to calculate the checksum in S3.?

Yes we have. We don’t have to calculate again, but use one of the properties of an S3 file to compare it with the source file. Every S3 file has a property called ETag. This etag is a checksum that is calculated while the file is transferred to S3. The tricky part is the way in which Etag is calculated. Etag can be calculated in different ways. So the Etag of a file may be different depending upon the way we transfer the file.

The funda is simple. The Etag of a file depends on the chunk size in which the file gets transferred to S3. So for validating a file, we have to find the etag of the S3 file and calculate a checksum of the local file using the same logic that is used to calculate the Etag of that file in S3. The etag calculation of files uploaded to S3 in normal way is simple and it will be equal to normal md5 checksum. But if we use multipart upload, then the Etag differs. Now the question arises, what is multipart upload ??

Inorder to transfer large files to S3, it is divide it into small parts and upload the parts in parallel and assemble at the S3 side. If we transmit a single large file directly, if some failure happens, the entire file transfer fails and restartability will be also difficult. But if we divide the large file into smaller chunks and transfer it in parallel, the transmission speed increases, the reliability also increases. If the transfer of a chunk fails, we can retry that chunk alone and hence improves the restartability.

Here I am giving an example of checking the Etag of a file and comparing it with the normal md5 checksum of the file.

Suppose I have an S3 bucket with the name checksum-testand I have a file with with the name sample.txt which is of 100 MB inside the checksum-test bucket in a location file/sample.txt

Then the bucket name is checksum-test
full key name will be file/sample.txt

Python code to find the md5 checksum of a file

Checksum calculation is an unavoidable and very important step in places where we transfer files/data. The simplest way to ensure whether a file reached the destination properly or not is by comparing the checksum of source and target files. Checksum can be calculated in several ways. One is by calculating the checksum by keeping the entire file as a single block. Another way is multipart checksum calculation, where we calculate the checksum of multiple small chunks in the file and finally calculating the aggregated checksum.
Here I am explaining about the calculation of checksum of a file using the simplest way. I am using the hashlib library in python for calculating the checksum.
Suppose I have a zip file located in the location /home/coder/data.zip. The checksum of the file can be calculated as follows.

import hashlib
file_name = ‘/home/amal/data.zip’
checksum = hashlib.md5(open(file_name).read()).hexdigest()
print checksum

One common mistake I have seen among people is passing the file name directly without opening the file

Eg: hashlib.md5(file_name).hexdigest()

This will also return a checksum. But it will be calculating the checksum of the file name, not the checksum calculated based on the contents of the file. So always use the checksum calculation as follows

hashlib.md5(open(file_name).read()).hexdigest()

This will return the exact checksum.

In linux, you can calculate the md5sum using a commandline utility also.

> md5sum file_name

Python program to list all redshift clusters across all regions

Python program to list the details of all the redshift clusters across all regions.