Data Manipulation in Python-Try Pandas

Data manipulation method in general Pandas

I wrote this article before

◆ Basic summary of data manipulation in Python Pandas Method list http://qiita.com/hik0107/items/d991cc44c2d1778bb82e

When manipulating data with Pandas, the above method is common, but You may find the code a bit verbose or unreadable.

I would like to introduce a package called "pandas_ply" for such people. It is especially recommended for those who have used Dplyr because it can handle data in a notation similar to R's Dplyr.

It's just like the Pandas version of Dplyr.

Even if you haven't used it, I think it's easier to use than native Pandas. Please try by all means try.

Start using Pandas_ply

◆ pandas_ply package https://pypi.python.org/pypi/pandas-ply

Install pandas_ply

pip install pandas_ply

Package preparation

`setup.py`


import pandas as pd
from pandas_ply import install_ply, X, sym_call

install_ply(pd)

If you call pandas and then install_ply from pandas_ply, It is a specification that the method of pandas_ply is given to pandas

You are now ready.

Actually use

Click here for detailed usage (English) http://pythonhosted.org/pandas-ply/

I will use this for the data. It's famous data from Kaggle. titanic - train.csv https://www.kaggle.com/c/titanic/data

`load.py`


csv_path = "/files_dir/train.csv" ##Specify the location of the csv file
data = pd.read_csv(csv_path, delimiter=",")

##Easy to see data: Required for new datasets
print data.head(10)
print data.shape
data.describe()
print data.columns

Data selection ply_select

You can access the column with "column_name" or X.column_name. Also, create a new column (mutate-like usage in Dplyr)

`select.py`


data.ply_select("Name", "Age",
                gender = X.Sex,  ##You can change the column name
                is_adult = (X.Age >= 20)  ##It will also be possible to define new columns
                )

Data selection ply_where

Use ply_where when you want to subset only data that meets certain conditions

`where.py`


data.ply_where(X.Age>10, 
               X.Sex == "male",
               X.Embarked == "S"
               )  #Only data that meets all the conditions with And is selected

If you write the same thing in Pandas' native way, it looks like this:

`where_equivalent.py`


data.query(" Age>10 & Sex=='male' & Embarked == 'S' ") 
##The strings are confusing and a little confusing

data.ix[(data.Age>10) & (data.Sex =='male') & (data.Embarked=='S')] 
##You need to write the df name many times

I think there are individual tastes, but I think that pandas_ply is relatively readable.

Add flag-in this case you might want to use apply

I wrote the method of adding a new column (ply_select) above, If you want to generate a new column under complicated conditions, it is better to use Pandas' apply method obediently. It may be good.

For example, if you want to add a new attribute to the column called "demographic" using age and gender in the above data Call as follows. Rest assured that the new column generation conditions will remain easy to see as a function.

`apply.py`


def add_demographic(data_input):
    if data_input.Age >=20:
        demo = "adut_m" if data_input.Sex == "male" else "adult_f"
    else:
        demo = "boy_and_girl"
    
    return demo

data.ix[ : , "Demographic"] = data.apply(add_demographic, axis=1)

At the end

It seems that pandas_ply is still under development, so use it systematically. This package does not have much information, so please comment if you are familiar with it.

Data Manipulation in Python-Try Pandas_plyr

Data manipulation method in general Pandas

Start using Pandas_ply

setup.py

Actually use

load.py