Python code to list all the running EC2 instances across all regions in an AWS account

This code snippet will help you to get the list of all running EC2 instances across all regions in an AWS account. I have used python boto3 package for developing the code. This code will dynamically pick up all the aws ec2 regions. So the code will work perfectly without any modification even if a new region gets added to the AWS.

Note: Only the basic api calls just to list the instance details are mentioned in this program . Proper coding convention is not followed . 🙂

Changing the python version in pyspark

pyspark will pick one version of python from the multiple versions of python installed in the machine. In my case, I have python 3, 2.7 and 2.6 installed in my machine and pyspark was picking python 3 by default. If we have to change the python version used by pyspark, set the following environment variable and run pyspark.

export PYSPARK_PYTHON=python2.6

similarly we can configure any version of python with pyspark. Ensure that python2.6 or whatever you are specifying is available

Programmatic way to reboot EC2 instances

Sometimes we might have to reboot EC2 instances. If the requirement is to restart EC2 instances regularly, we can achieve it by writing a small piece of code. I also came across a similar requirement and a portion of the code I used is given below.

 

How to hide or obfuscate python source code ?

Sometimes we may have the requirement to provide applications without source code. In Java it is very easy and people are widely using also. If we want to hide our source code in python what we will do ??

I checked for several solutions for obfuscating the source code . One is using pyminifier. This is  a good tool. This will rename the methods and variables. So that the obfuscated code will look more complicated. But still if you spend some time, we can read it.

Another best way to hide the source code completely is by using the built-in compiler in the python itself. This will generate a byte code and we can use that for execution.

python -OO -m py_compile  <your code.py>

This will generate a .pyo file. Rename the .pyo file to .py extension. You can use this for execution. This will work just like the actual code.

NB : If your program imports modules obfuscated like this, then you have to rename them with a .pyc suffix instead

Programmatic Data Upload to Amazon S3

S3 is a service provided by Amazon for storing data. The full form is Simple Storage Service. S3 is a very useful service for less price. Data can be uploaded to and downloaded from S3 very easily using some tools as well as program. Here I am explaining  a sample program for uploading file to S3 using a python program.

Files can be uploaded to S3 in two approaches. One is the normal upload and another is the multipart upload. Normal upload sends the file serially and is not suitable for large files. It will take more time. For large files, multipart upload is the best option. It will upload the file by dividing it into chunks and sends it in parallel and collects it in S3.

This program is using the normal approach for sending the files to S3. Here I used the boto library for uploading the files.

Ways to find out the count of unique items in a list using python

Last day I was trying for a method to find the count of unique items in a list. I found two solutions. Thought like it is worth to share here.

Method 1:

This method works only with python versions 2.7 and above. The collections library is not available in python versions less than 2.7.

Method 2:

This is a very simple method using python dictionary. This will work in all versions of python.

A simple program to begin with python tornado

Python tornado is a powerful framework for dealing with HTTP requests. It helps us to write responses to HTTP requests in a very simple and elegant way. We can write handlers for responding to the HTTP requests very easily using tornado. This is very simple to learn and use. We can create an excellent application in very few lines of code. Here I am explaining about a simple tornado application. The code is attached below. For running this code, you need tornado to be installed in your machine.

In the beginning of the code, you can see few imports related to tornado. These libraries are required for our tornado application. After that you can see a define function. This define is imported from a library called options in tornado. Using this we can get user defined arguments from commandline. Here we are defining the port in which our tornado application should run. If the user is specifying the port in the command line, it will use that value, else it will use the default value. Here I gave default value as 8888. 

The next part is a class named HelloWorldHandler. This is a class extending the tornado RequestHandler class. This is basically a handler, which means this will handle an HTTP request. This class will be called based on the navigation rules that we define in the tornado. In this class there is only one method called get(). So this handler can handle only get requests. In the get method, we are just printing a text “Someone called me” and writing a response. So whenever this class is called, The text “Someone called me” will be printed in the console and the self.write(“Welcome to Tornado..!!”) will send the this string to the HTTP response.

The next part will run the tornado application. The “tornado.web.Application(handlers=[(r”/”, HelloWorldHandler)])” defines when to invoke the handler. the r”/” is a regex. So if the url comes without any path, the request will be navigated to HelloWorldHandler class. Similar to this we can have a list of regex – handler class pairs. Here we have only one.

Execution.

python HelloTornado.py –port 9090

This will run the application in 9090 port. After this open the web browser and check http://localhost:9090. You will get a message “Welcome to Tornado.!!” on the screen. For every hit, you can see a message “Someone called me” getting printed in the console.

python HelloWorld.py

This will run the application in the default port that we specified. I specified 8888. So open the webbrowser and check http://localhost:8888.

Note: If you are executing the code in a different machine, you should use the ip address of the machine instead of localhost.

Happy Learning … 🙂

Python code for calculating the difference between two time stamps

I was searching for a way to find the difference between two timestamps. My requirement is to get the difference in terms of years, months, days, hours and minutes. I found a way to get it. The below code contains the logic to get the required output. I haven’t seen this code anywhere in internet, that is the reason I am posting this here so that this will be helpful for someone.

How to validate a file in S3

S3 is a storage service provided by Amazon. We can use this as a place to store, backup or archive our data. S3 is a storage which is accessible from the public network. So the data reaches S3 through internet. So while doing the data transmission to S3, one important thing that we have to ensure is the correctness of the data. Because if the data gets

corrupted while transferring, it will be a big problem. So we have to ensure the correctness of the data. This is possible only by comparing the S3 copy with the master copy. But how to achieve this ???

In local file system we can do the file comparison by calculating the checksum. But in S3 how we will perform this ?.
Calculating checksum involves reading the complete file. But do we have a provision to calculate the checksum in S3.?

Yes we have. We don’t have to calculate again, but use one of the properties of an S3 file to compare it with the source file. Every S3 file has a property called ETag. This etag is a checksum that is calculated while the file is transferred to S3. The tricky part is the way in which Etag is calculated. Etag can be calculated in different ways. So the Etag of a file may be different depending upon the way we transfer the file.

The funda is simple. The Etag of a file depends on the chunk size in which the file gets transferred to S3. So for validating a file, we have to find the etag of the S3 file and calculate a checksum of the local file using the same logic that is used to calculate the Etag of that file in S3. The etag calculation of files uploaded to S3 in normal way is simple and it will be equal to normal md5 checksum. But if we use multipart upload, then the Etag differs. Now the question arises, what is multipart upload ??

Inorder to transfer large files to S3, it is divide it into small parts and upload the parts in parallel and assemble at the S3 side. If we transmit a single large file directly, if some failure happens, the entire file transfer fails and restartability will be also difficult. But if we divide the large file into smaller chunks and transfer it in parallel, the transmission speed increases, the reliability also increases. If the transfer of a chunk fails, we can retry that chunk alone and hence improves the restartability.

Here I am giving an example of checking the Etag of a file and comparing it with the normal md5 checksum of the file.

Suppose I have an S3 bucket with the name checksum-testand I have a file with with the name sample.txt which is of 100 MB inside the checksum-test bucket in a location file/sample.txt

Then the bucket name is checksum-test
full key name will be file/sample.txt

Python code to find the md5 checksum of a file

Checksum calculation is an unavoidable and very important step in places where we transfer files/data. The simplest way to ensure whether a file reached the destination properly or not is by comparing the checksum of source and target files. Checksum can be calculated in several ways. One is by calculating the checksum by keeping the entire file as a single block. Another way is multipart checksum calculation, where we calculate the checksum of multiple small chunks in the file and finally calculating the aggregated checksum.
Here I am explaining about the calculation of checksum of a file using the simplest way. I am using the hashlib library in python for calculating the checksum.
Suppose I have a zip file located in the location /home/coder/data.zip. The checksum of the file can be calculated as follows.

import hashlib
file_name = ‘/home/amal/data.zip’
checksum = hashlib.md5(open(file_name).read()).hexdigest()
print checksum

One common mistake I have seen among people is passing the file name directly without opening the file

Eg: hashlib.md5(file_name).hexdigest()

This will also return a checksum. But it will be calculating the checksum of the file name, not the checksum calculated based on the contents of the file. So always use the checksum calculation as follows

hashlib.md5(open(file_name).read()).hexdigest()

This will return the exact checksum.

In linux, you can calculate the md5sum using a commandline utility also.

> md5sum file_name