Data Manipulation in Python-Try Pandas_plyr

Data manipulation method in general Pandas

I wrote this article before

◆ Basic summary of data manipulation in Python Pandas Method list http://qiita.com/hik0107/items/d991cc44c2d1778bb82e

When manipulating data with Pandas, the above method is common, but You may find the code a bit verbose or unreadable.

I would like to introduce a package called "pandas_ply" for such people. It is especially recommended for those who have used Dplyr because it can handle data in a notation similar to R's Dplyr.

Even if you haven't used it, I think it's easier to use than native Pandas. Please try by all means try.

Start using Pandas_ply

◆ pandas_ply package https://pypi.python.org/pypi/pandas-ply

Install pandas_ply

pip install pandas_ply

Package preparation

setup.py


import pandas as pd
from pandas_ply import install_ply, X, sym_call

install_ply(pd)

If you call pandas and then install_ply from pandas_ply, It is a specification that the method of pandas_ply is given to pandas

You are now ready.

Actually use

Click here for detailed usage (English) http://pythonhosted.org/pandas-ply/

I will use this for the data. It's famous data from Kaggle. titanic - train.csv https://www.kaggle.com/c/titanic/data

load.py


csv_path = "/files_dir/train.csv" ##Specify the location of the csv file
data = pd.read_csv(csv_path, delimiter=",")

##Easy to see data: Required for new datasets
print data.head(10)
print data.shape
data.describe()
print data.columns

Data selection ply_select

You can access the column with "column_name" or X.column_name. Also, create a new column (mutate-like usage in Dplyr)

select.py


data.ply_select("Name", "Age",
                gender = X.Sex,  ##You can change the column name
                is_adult = (X.Age >= 20)  ##It will also be possible to define new columns
                )

Data selection ply_where

Use ply_where when you want to subset only data that meets certain conditions

where.py


data.ply_where(X.Age>10, 
               X.Sex == "male",
               X.Embarked == "S"
               )  #Only data that meets all the conditions with And is selected

If you write the same thing in Pandas' native way, it looks like this:

where_equivalent.py


data.query(" Age>10 & Sex=='male' & Embarked == 'S' ") 
##The strings are confusing and a little confusing

data.ix[(data.Age>10) & (data.Sex =='male') & (data.Embarked=='S')] 
##You need to write the df name many times

I think there are individual tastes, but I think that pandas_ply is relatively readable.

Add flag-in this case you might want to use apply

I wrote the method of adding a new column (ply_select) above, If you want to generate a new column under complicated conditions, it is better to use Pandas' apply method obediently. It may be good.

For example, if you want to add a new attribute to the column called "demographic" using age and gender in the above data Call as follows. Rest assured that the new column generation conditions will remain easy to see as a function.

apply.py


def add_demographic(data_input):
    if data_input.Age >=20:
        demo = "adut_m" if data_input.Sex == "male" else "adult_f"
    else:
        demo = "boy_and_girl"
    
    return demo

data.ix[ : , "Demographic"] = data.apply(add_demographic, axis=1)

At the end

It seems that pandas_ply is still under development, so use it systematically. This package does not have much information, so please comment if you are familiar with it.

Recommended Posts

Data Manipulation in Python-Try Pandas_plyr
PySpark data manipulation
Sampling in imbalanced data
Date manipulation in Python
Data manipulation with Pandas!
Basic summary of data manipulation in Python Pandas-Second half: Data aggregation
Handle Ambient data in Python
Display UTM-30LX data in Python
Write data in HDF format
String date manipulation in Python
Get Leap Motion data in Python.
[Translation] scikit-learn 0.18 Tutorial Text data manipulation
Export DB data in json format
Pixel manipulation of images in Python
File / folder path manipulation in Python
Read Protocol Buffers data in Python3
Get data from Quandl in Python
Handle NetCDF format data in Python
Data visualization in Python-draw cool heatmaps
Store RSS data in Zabbix (Zabbix sender)
Try to put data in MongoDB
Data prediction competition in 3 steps (titanic)
Hashing data in R and Python
Machine learning in Delemas (data acquisition)
Check the data summary in CASTable
Preprocessing in machine learning 2 Data acquisition
Preprocessing in machine learning 4 Data conversion