How to convert or change the data type of columns in Pandas dataframe ?

Changing the datatype of columns in pandas dataframe is very easy. Here I am using stype() function to perform the typecase operation.  Refer to the following example. The type conversion is happening in the line number 10 of the code.

 

You can add as many columns as you want to convert the data type or typecast. For example if you want to typecast the columns emp_id and salary, use the following syntax.

> df = df.astype({‘salary’:‘int’, ’emp_id’:’int’})

 

Convert csv to json using pandas

The following sample program explains you on how to read a csv file and convert it into json data. Two programs are explained in this blog post. The first program expects the column names in the csv file and second program does not need column names in the file.

The first program expects the headers in the first line of the csv. In case of missing headers, we have to pass it explicitly in the program.

Sample Input

EMPID,FirstName,LastName,Salary
1001,Amal,Jose,100000
1002,Edward,Joe,100001
1003,Sabitha,Sunny,210000
1004,John,P,50000
1005,Mohammad,S,75000

Here the first line of the csv data is the header

Sample Output

[{"EMPID":1001,"FirstName":"Amal","LastName":"Jose","Salary":100000},{"EMPID":1002,"FirstName":"Edward","LastName":"Joe","Salary":100001},{"EMPID":1003,"FirstName":"Sabitha","LastName":"Sunny","Salary":210000},{"EMPID":1004,"FirstName":"John","LastName":"P","Salary":50000},{"EMPID":1005,"FirstName":"Mohammad","LastName":"S","Salary":75000}]

 

If the csv file contains a header row, then you should explicitly pass header=0 to override the column names. If headers are not present in the csv file, we have to explicitly pass the field names in a list to the argument names. Duplicates in this list are not allowed. A sample implementation is given below.

 

What is Pandas ?

Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Pandas comes with two primary data structures

  • Series – (One dimensional)
  • DataFrame – (Two dimensional)

These two structures helps us to handle majority of the usecases. Those who are handy with R programming language can easily implement their logic in a much powerful and better way using python pandas. Users get almost all the functionalities present in the R’s dataframe. Pandas is built on top of the popular Numpy package.

Pandas has very good timeseries data handling and processing capability. We can avoid unnecessary loops and logic by implementing pandas. It is capable of doing

  • Frequency conversion (Eg: creating 5 minute data using a dataset with 1 second frequency),
  • Data range generation
  • Moving window statistics
  • date shifting etc.

Since there are so many documents related to the pandas, I am not going to explain pandas in detail. I will be explaining some usecases with pandas implementation in my further blog posts. I will be using pandas and other scientific libraries extensively in my upcoming blog posts.