[PYTHON] The minimum methods to remember when aggregating data in Pandas

Make a note of frequently used items for basic tabulation, and plan to update at any time

Preparation

from sklearn import datasets
import pandas as pd
from collections import OrderedDict

iris = datasets.load_iris()

df = pd.concat([pd.DataFrame(iris.data,columns=iris.feature_names),pd.DataFrame(iris.target,columns=["species"])],axis=1)

スクリーンショット 2017-05-09 9.51.52.png

I'm glad that the recent update makes pandas easier to see.

Aggregate

describe

df.describe()

スクリーンショット 2017-12-22 14.42.57.png

Basic statistics can be output

df["petal length (cm)"].describe()

スクリーンショット 2017-05-09 9.34.06.png

Can be calculated with Series alone

value_counts

df["species"].value_counts()

スクリーンショット 2017-05-09 9.36.35.png

Can be counted

get_dummies

pd.get_dummies(df["species"]).ix[[0,1,2,50,51,52,100,101,102]]

スクリーンショット 2017-05-09 9.37.32.png

So-called dummy variables can be created (Indexed for easy viewing)

sort_values

df.sort_values("sepal length (cm)",ascending=False)

スクリーンショット 2017-05-09 9.38.41.png

df can be sorted (ascending specifies ascending / descending order)

df.sort_values(["sepal length (cm)","sepal width (cm)"],ascending=False)

スクリーンショット 2017-05-09 9.55.50.png

Multiple specifications are possible (priority is the index of the argument list)

groupby

df_groupby = df.groupby("species",as_index=False)
df_groupby.mean()

スクリーンショット 2017-05-09 9.57.37.png

Since the groupby object can be reused, it is faster to store the groupby object in a variable when you want to apply multiple similar aggregations.

groupby.agg

df_groupby.agg({"sepal length (cm)": "mean",
                "sepal width (cm)": ["mean","count"],
                "petal length (cm)": ["max","min"],
                "petal width (cm)": ["sum","var","std"]})

スクリーンショット 2017-05-09 9.43.42.png

By specifying in dictionary format, individual aggregation for each column is possible (However, if you specify multiple aggregations for one column, it will be multi-column, so be careful)

Also, since this is in no particular order, use OrderedDict if you want to specify the order.

df_groupby.agg(OrderedDict((["sepal length (cm)", "mean"],
                            ["sepal width (cm)", ["mean","count"]],
                            ["petal length (cm)", ["max","min"]],
                            ["petal width (cm)", ["sum","var","std"]])))

スクリーンショット 2017-05-09 9.59.54.png

to_csv

df.to_csv("test.csv",index=False,encoding="utf8")
pd.read_csv("test.csv")

スクリーンショット 2017-05-09 9.50.51.png

When index = False, the next reading is easy Sometimes it cannot be read unless encoding is specified (especially Windows)

Visualization

Preparation

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

% matplotlib inline is a magic command for visualization in Jupyter

Box plot

sns.boxplot(data=df, x="species", y="sepal length (cm)")

スクリーンショット 2017-12-22 14.50.52.png

pairplot sns.pairplot(data=df)

sns.pairplot(data=df, hue="species")

Can be divided by segment

jointplot sns.jointplot(data=df, x="sepal length (cm)", y="sepal width (cm)", kind="kde")

distplot sns.distplot(df["sepal length (cm)"], rug=True,)

Recommended Posts

The minimum methods to remember when aggregating data in Pandas

A collection of methods used when aggregating data with pandas

The first step to log analysis (how to format and put log data in Pandas)

How to access with cache when reading_json in pandas

Try to decipher the login data stored in Firefox

[For beginners of competitive pros] Three input methods to remember when starting competitive programming in Python

[Pandas] If the first row data is in the header in DataFrame

Various ways to calculate the similarity between data in python

Ingenuity to handle data with Pandas in a memory-saving manner

Put the lists together in pandas to make a DataFrame

I tried to summarize the code often used in Pandas

Precautions when changing unix time to datetime type in pandas

Change the message displayed when logging in to Raspberry Pi

How to get an overview of your data in Pandas

Data science companion in python, how to specify elements in pandas

Automatically acquire the operation log in the terminal when logging in to Linux

[Linux] I want to know the date when the user logged in

<Pandas> How to handle time series data in a pivot table

What to do when UnicodeDecodeError occurs during read_csv in pandas (pd.read_table ())

What to do when the value type is ambiguous in Python?

Programming to fight in the world ~ 5-1

Programming to fight in the world ~ 5-5,5-6

Programming to fight in the world 5-3

How to write soberly in pandas

Programming to fight in the world-Chapter 4

In the python command python points to python3.8

Check the data summary in CASTable

Cython to try in the shortest

When the node disappears in rqt_graph

Programming to fight in the world ~ 5-2

I tried to summarize the methods that are often used when implementing basic algo in Quantx Factory

What to do when the result downloaded via scrapy is in English

How to hide the command prompt when running python in visual studio 2015

Do not change the order of columns when concatenating pandas data frames.

I measured 6 methods to get the index of the maximum value (minimum value) of the list

What to do when the warning "The environment is in consistent ..." appears in the Anaconda environment

Behavior when returning in the with block

Summary of methods often used in pandas

10 methods to improve the accuracy of BERT

Get the top nth values in Pandas

Precautions when using for statements in pandas

Use pandas to convert grid data to row-holding (?) Data

Minimum knowledge to use Form in Flask

How to reassign index in pandas dataframe

[Pandas] Expand the character string to DataFrame

When the target is Ubuntu 16.04 in Ansible

[Python] Pandas to fully understand in 10 minutes

Try converting to tidy data with pandas

RDS data via stepping stones in Pandas

Various comments to write in the program

When the previous graph remains in Seaborn

How to read CSV files in Pandas

Adding Series to columns in python pandas

Working with 3D data structures in pandas

Is there NaN in the pandas DataFrame?

Books on data science to read in 2020

Function to extract the maximum and minimum values in a slice with Go

[pandas] When specifying the default Index label in the at method, "" is not required

[Python] I want to know the variables in the function when an error occurs!

[Python] Precautions when retrieving data by scraping and putting it in the list

I stumbled on the character code when converting CSV to JSON in Python