Common dependencies to install PyCrypto package in CentOS/RHEL

The installation of pycrypto package may fail with errors like

“error: no acceptable C compiler found in $PATH”

“RuntimeError: autoconf error”

“fatal error: Python.h: No such file or directory”

” #include “Python.h”
compilation terminated.
error: command ‘gcc’ failed with exit status 1″

The solution for this issue is to install the following dependent packages.

yum install gcc

yum install gcc-c++

yum install python-devel

pip install pycrypto

Python code to list all the running EC2 instances across all regions in an AWS account

This code snippet will help you to get the list of all running EC2 instances across all regions in an AWS account. I have used python boto3 package for developing the code. This code will dynamically pick up all the aws ec2 regions. So the code will work perfectly without any modification even if a new region gets added to the AWS.

Note: Only the basic api calls just to list the instance details are mentioned in this program . Proper coding convention is not followed . 🙂

Changing the python version in pyspark

pyspark will pick one version of python from the multiple versions of python installed in the machine. In my case, I have python 3, 2.7 and 2.6 installed in my machine and pyspark was picking python 3 by default. If we have to change the python version used by pyspark, set the following environment variable and run pyspark.

export PYSPARK_PYTHON=python2.6

similarly we can configure any version of python with pyspark. Ensure that python2.6 or whatever you are specifying is available


Programmatic way to reboot EC2 instances

Sometimes we might have to reboot EC2 instances. If the requirement is to restart EC2 instances regularly, we can achieve it by writing a small piece of code. I also came across a similar requirement and a portion of the code I used is given below.



How to hide or obfuscate python source code ?

Sometimes we may have the requirement to provide applications without source code. In Java it is very easy and people are widely using also. If we want to hide our source code in python what we will do ??

I checked for several solutions for obfuscating the source code . One is using pyminifier. This is  a good tool. This will rename the methods and variables. So that the obfuscated code will look more complicated. But still if you spend some time, we can read it.

Another best way to hide the source code completely is by using the built-in compiler in the python itself. This will generate a byte code and we can use that for execution.

python -OO -m py_compile  <your>

This will generate a .pyo file. Rename the .pyo file to .py extension. You can use this for execution. This will work just like the actual code.

NB : If your program imports modules obfuscated like this, then you have to rename them with a .pyc suffix instead


Programmatic Data Upload to Amazon S3

S3 is a service provided by Amazon for storing data. The full form is Simple Storage Service. S3 is a very useful service for less price. Data can be uploaded to and downloaded from S3 very easily using some tools as well as program. Here I am explaining  a sample program for uploading file to S3 using a python program.

Files can be uploaded to S3 in two approaches. One is the normal upload and another is the multipart upload. Normal upload sends the file serially and is not suitable for large files. It will take more time. For large files, multipart upload is the best option. It will upload the file by dividing it into chunks and sends it in parallel and collects it in S3.

This program is using the normal approach for sending the files to S3. Here I used the boto library for uploading the files.


Ways to find out the count of unique items in a list using python

Last day I was trying for a method to find the count of unique items in a list. I found two solutions. Thought like it is worth to share here.

Method 1:

This method works only with python versions 2.7 and above. The collections library is not available in python versions less than 2.7.

Method 2:

This is a very simple method using python dictionary. This will work in all versions of python.


A simple program to begin with python tornado

Python tornado is a powerful framework for dealing with HTTP requests. It helps us to write responses to HTTP requests in a very simple and elegant way. We can write handlers for responding to the HTTP requests very easily using tornado. This is very simple to learn and use. We can create an excellent application in very few lines of code. Here I am explaining about a simple tornado application. The code is attached below. For running this code, you need tornado to be installed in your machine.

In the beginning of the code, you can see few imports related to tornado. These libraries are required for our tornado application. After that you can see a define function. This define is imported from a library called options in tornado. Using this we can get user defined arguments from commandline. Here we are defining the port in which our tornado application should run. If the user is specifying the port in the command line, it will use that value, else it will use the default value. Here I gave default value as 8888. 

The next part is a class named HelloWorldHandler. This is a class extending the tornado RequestHandler class. This is basically a handler, which means this will handle an HTTP request. This class will be called based on the navigation rules that we define in the tornado. In this class there is only one method called get(). So this handler can handle only get requests. In the get method, we are just printing a text “Someone called me” and writing a response. So whenever this class is called, The text “Someone called me” will be printed in the console and the self.write(“Welcome to Tornado..!!”) will send the this string to the HTTP response.

The next part will run the tornado application. The “tornado.web.Application(handlers=[(r”/”, HelloWorldHandler)])” defines when to invoke the handler. the r”/” is a regex. So if the url comes without any path, the request will be navigated to HelloWorldHandler class. Similar to this we can have a list of regex – handler class pairs. Here we have only one.


python –port 9090

This will run the application in 9090 port. After this open the web browser and check http://localhost:9090. You will get a message “Welcome to Tornado.!!” on the screen. For every hit, you can see a message “Someone called me” getting printed in the console.


This will run the application in the default port that we specified. I specified 8888. So open the webbrowser and check http://localhost:8888.

Note: If you are executing the code in a different machine, you should use the ip address of the machine instead of localhost.

Happy Learning … 🙂


Python code for calculating the difference between two time stamps

I was searching for a way to find the difference between two timestamps. My requirement is to get the difference in terms of years, months, days, hours and minutes. I found a way to get it. The below code contains the logic to get the required output. I haven’t seen this code anywhere in internet, that is the reason I am posting this here so that this will be helpful for someone.


How to validate a file in S3

S3 is a storage service provided by Amazon. We can use this as a place to store, backup or archive our data. S3 is a storage which is accessible from the public network. So the data reaches S3 through internet. So while doing the data transmission to S3, one important thing that we have to ensure is the correctness of the data. Because if the data gets

corrupted while transferring, it will be a big problem. So we have to ensure the correctness of the data. This is possible only by comparing the S3 copy with the master copy. But how to achieve this ???

In local file system we can do the file comparison by calculating the checksum. But in S3 how we will perform this ?.
Calculating checksum involves reading the complete file. But do we have a provision to calculate the checksum in S3.?

Yes we have. We don’t have to calculate again, but use one of the properties of an S3 file to compare it with the source file. Every S3 file has a property called ETag. This etag is a checksum that is calculated while the file is transferred to S3. The tricky part is the way in which Etag is calculated. Etag can be calculated in different ways. So the Etag of a file may be different depending upon the way we transfer the file.

The funda is simple. The Etag of a file depends on the chunk size in which the file gets transferred to S3. So for validating a file, we have to find the etag of the S3 file and calculate a checksum of the local file using the same logic that is used to calculate the Etag of that file in S3. The etag calculation of files uploaded to S3 in normal way is simple and it will be equal to normal md5 checksum. But if we use multipart upload, then the Etag differs. Now the question arises, what is multipart upload ??

Inorder to transfer large files to S3, it is divide it into small parts and upload the parts in parallel and assemble at the S3 side. If we transmit a single large file directly, if some failure happens, the entire file transfer fails and restartability will be also difficult. But if we divide the large file into smaller chunks and transfer it in parallel, the transmission speed increases, the reliability also increases. If the transfer of a chunk fails, we can retry that chunk alone and hence improves the restartability.

Here I am giving an example of checking the Etag of a file and comparing it with the normal md5 checksum of the file.

Suppose I have an S3 bucket with the name checksum-testand I have a file with with the name sample.txt which is of 100 MB inside the checksum-test bucket in a location file/sample.txt

Then the bucket name is checksum-test
full key name will be file/sample.txt