[PYTHON] The minimum methods to remember when aggregating data in Pandas

Make a note of frequently used items for basic tabulation, and plan to update at any time

Preparation

from sklearn import datasets
import pandas as pd
from collections import OrderedDict

iris = datasets.load_iris()

df = pd.concat([pd.DataFrame(iris.data,columns=iris.feature_names),pd.DataFrame(iris.target,columns=["species"])],axis=1)

スクリーンショット 2017-05-09 9.51.52.png

I'm glad that the recent update makes pandas easier to see.

Aggregate

describe

df.describe()

スクリーンショット 2017-12-22 14.42.57.png

Basic statistics can be output

df["petal length (cm)"].describe()

スクリーンショット 2017-05-09 9.34.06.png

Can be calculated with Series alone

value_counts

df["species"].value_counts()

スクリーンショット 2017-05-09 9.36.35.png

Can be counted

get_dummies

pd.get_dummies(df["species"]).ix[[0,1,2,50,51,52,100,101,102]]

スクリーンショット 2017-05-09 9.37.32.png

So-called dummy variables can be created (Indexed for easy viewing)

sort_values

df.sort_values("sepal length (cm)",ascending=False)

スクリーンショット 2017-05-09 9.38.41.png

df can be sorted (ascending specifies ascending / descending order)

df.sort_values(["sepal length (cm)","sepal width (cm)"],ascending=False)

スクリーンショット 2017-05-09 9.55.50.png

Multiple specifications are possible (priority is the index of the argument list)

groupby

df_groupby = df.groupby("species",as_index=False)
df_groupby.mean()

スクリーンショット 2017-05-09 9.57.37.png

Since the groupby object can be reused, it is faster to store the groupby object in a variable when you want to apply multiple similar aggregations.

groupby.agg

df_groupby.agg({"sepal length (cm)": "mean",
                "sepal width (cm)": ["mean","count"],
                "petal length (cm)": ["max","min"],
                "petal width (cm)": ["sum","var","std"]})

スクリーンショット 2017-05-09 9.43.42.png

By specifying in dictionary format, individual aggregation for each column is possible (However, if you specify multiple aggregations for one column, it will be multi-column, so be careful)

Also, since this is in no particular order, use OrderedDict if you want to specify the order.

df_groupby.agg(OrderedDict((["sepal length (cm)", "mean"],
                            ["sepal width (cm)", ["mean","count"]],
                            ["petal length (cm)", ["max","min"]],
                            ["petal width (cm)", ["sum","var","std"]])))

スクリーンショット 2017-05-09 9.59.54.png

to_csv

df.to_csv("test.csv",index=False,encoding="utf8")
pd.read_csv("test.csv")

スクリーンショット 2017-05-09 9.50.51.png

When index = False, the next reading is easy Sometimes it cannot be read unless encoding is specified (especially Windows)

Visualization

Preparation

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline is a magic command for visualization in Jupyter

Box plot

sns.boxplot(data=df, x="species", y="sepal length (cm)")

スクリーンショット 2017-12-22 14.50.52.png

pairplot sns.pairplot(data=df)

Unknown.png

sns.pairplot(data=df, hue="species")

Unknown.png

Can be divided by segment

jointplot sns.jointplot(data=df, x="sepal length (cm)", y="sepal width (cm)", kind="kde")

Unknown.png

distplot sns.distplot(df["sepal length (cm)"], rug=True,)

Unknown.png

Recommended Posts

The minimum methods to remember when aggregating data in Pandas
A collection of methods used when aggregating data with pandas
The first step to log analysis (how to format and put log data in Pandas)
How to access with cache when reading_json in pandas
Try to decipher the login data stored in Firefox
[For beginners of competitive pros] Three input methods to remember when starting competitive programming in Python
[Pandas] If the first row data is in the header in DataFrame
Various ways to calculate the similarity between data in python
Ingenuity to handle data with Pandas in a memory-saving manner
Put the lists together in pandas to make a DataFrame
I tried to summarize the code often used in Pandas
Precautions when changing unix time to datetime type in pandas
Change the message displayed when logging in to Raspberry Pi
How to get an overview of your data in Pandas
Data science companion in python, how to specify elements in pandas
Automatically acquire the operation log in the terminal when logging in to Linux
[Linux] I want to know the date when the user logged in
<Pandas> How to handle time series data in a pivot table
What to do when UnicodeDecodeError occurs during read_csv in pandas (pd.read_table ())
What to do when the value type is ambiguous in Python?
Programming to fight in the world ~ 5-1
Programming to fight in the world ~ 5-5,5-6
Programming to fight in the world 5-3
How to write soberly in pandas
Programming to fight in the world-Chapter 4
In the python command python points to python3.8
Check the data summary in CASTable
Cython to try in the shortest
When the node disappears in rqt_graph
Programming to fight in the world ~ 5-2
I tried to summarize the methods that are often used when implementing basic algo in Quantx Factory
What to do when the result downloaded via scrapy is in English
How to hide the command prompt when running python in visual studio 2015
Do not change the order of columns when concatenating pandas data frames.
I measured 6 methods to get the index of the maximum value (minimum value) of the list
What to do when the warning "The environment is in consistent ..." appears in the Anaconda environment
Behavior when returning in the with block
Summary of methods often used in pandas
10 methods to improve the accuracy of BERT
Get the top nth values in Pandas
Precautions when using for statements in pandas
Use pandas to convert grid data to row-holding (?) Data
Minimum knowledge to use Form in Flask
How to reassign index in pandas dataframe
[Pandas] Expand the character string to DataFrame
When the target is Ubuntu 16.04 in Ansible
[Python] Pandas to fully understand in 10 minutes
Try converting to tidy data with pandas
RDS data via stepping stones in Pandas
Various comments to write in the program
When the previous graph remains in Seaborn
How to read CSV files in Pandas
Adding Series to columns in python pandas
Working with 3D data structures in pandas
Is there NaN in the pandas DataFrame?
Books on data science to read in 2020
Function to extract the maximum and minimum values ​​in a slice with Go
[pandas] When specifying the default Index label in the at method, "" is not required
[Python] I want to know the variables in the function when an error occurs!
[Python] Precautions when retrieving data by scraping and putting it in the list
I stumbled on the character code when converting CSV to JSON in Python