I wrote this article before
◆ Basic summary of data manipulation in Python Pandas Method list http://qiita.com/hik0107/items/d991cc44c2d1778bb82e
When manipulating data with Pandas, the above method is common, but You may find the code a bit verbose or unreadable.
I would like to introduce a package called "pandas_ply" for such people. It is especially recommended for those who have used Dplyr because it can handle data in a notation similar to R's Dplyr.
Even if you haven't used it, I think it's easier to use than native Pandas. Please try by all means try.
◆ pandas_ply package https://pypi.python.org/pypi/pandas-ply
Install pandas_ply
pip install pandas_ply
Package preparation
setup.py
import pandas as pd
from pandas_ply import install_ply, X, sym_call
install_ply(pd)
If you call pandas and then install_ply from pandas_ply, It is a specification that the method of pandas_ply is given to pandas
You are now ready.
Click here for detailed usage (English) http://pythonhosted.org/pandas-ply/
I will use this for the data. It's famous data from Kaggle. titanic - train.csv https://www.kaggle.com/c/titanic/data
load.py
csv_path = "/files_dir/train.csv" ##Specify the location of the csv file
data = pd.read_csv(csv_path, delimiter=",")
##Easy to see data: Required for new datasets
print data.head(10)
print data.shape
data.describe()
print data.columns
You can access the column with "column_name" or X.column_name. Also, create a new column (mutate-like usage in Dplyr)
select.py
data.ply_select("Name", "Age",
gender = X.Sex, ##You can change the column name
is_adult = (X.Age >= 20) ##It will also be possible to define new columns
)
Use ply_where when you want to subset only data that meets certain conditions
where.py
data.ply_where(X.Age>10,
X.Sex == "male",
X.Embarked == "S"
) #Only data that meets all the conditions with And is selected
If you write the same thing in Pandas' native way, it looks like this:
where_equivalent.py
data.query(" Age>10 & Sex=='male' & Embarked == 'S' ")
##The strings are confusing and a little confusing
data.ix[(data.Age>10) & (data.Sex =='male') & (data.Embarked=='S')]
##You need to write the df name many times
I think there are individual tastes, but I think that pandas_ply is relatively readable.
I wrote the method of adding a new column (ply_select) above, If you want to generate a new column under complicated conditions, it is better to use Pandas' apply method obediently. It may be good.
For example, if you want to add a new attribute to the column called "demographic" using age and gender in the above data Call as follows. Rest assured that the new column generation conditions will remain easy to see as a function.
apply.py
def add_demographic(data_input):
if data_input.Age >=20:
demo = "adut_m" if data_input.Sex == "male" else "adult_f"
else:
demo = "boy_and_girl"
return demo
data.ix[ : , "Demographic"] = data.apply(add_demographic, axis=1)
It seems that pandas_ply is still under development, so use it systematically. This package does not have much information, so please comment if you are familiar with it.
Recommended Posts