Functions that are unexpectedly necessary when creating tables and graphs. I tried to summarize such a pivot table function that there is a difference between heaven and earth if you do not know it. By the way, when I was a student, I didn't even know the pivot table function of Excel, and I remember crying and shaping the experimental data ... Since it was posted to Qiita, I tried it with the pandas data frame this time.

What is a pivot table?

One of the functions in Excel, it refers to a tool that extracts necessary data from a list registered in advance and aggregates and analyzes it from all directions. The "aggregate" function is especially convenient, and you can return the result shown on the right from the data format shown on the left in the figure below.

You will need it when you want to cross-tabulate or when you want to create a slightly complicated graph. This time I'll try something similar on a pandas dataframe instead of Excel.

Data used for practice

Borrowed train data from kaggle Titanic Competition. Please refer to here for the meaning of each column.

Main subject

It looks like this as it is ... Difficult to interpret!

Preparation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Path
input_path = "../input_data/"

# Set Display Max Columns
pd.set_option("display.max_columns", 50)

train = pd.read_csv(input_path + "train.csv", sep=",", header=0, quotechar="\"")
# test = pd.read_csv(input_path + "bank/test.csv", sep=",", header=0, quotechar="\"")

#Missing value processing because it is difficult to handle
train["Embarked"] = train.Embarked.fillna("S") 
train["Age"]=train["Age"].fillna(train.Age.mean())

#Age by age
labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
train["Age"] = pd.cut(train["Age"], np.arange(0, 101, 10),
                      include_lowest=True, right=False,
                      labels=labels)

#Focus on what seems to be easy to handle for this purpose
df = train[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]

#Data confirmation
df.head()

The names of passengers who seem to be difficult to handle are deleted.

Use the pivot_table function

At a minimum, the following three arguments are required

-Data (first argument): Specify the pandas.DataFrame object of the original data. -Index: Specify the column name of the original data. It becomes the result line heading. -Columns: Specify the column name of the original data. The resulting column heading.

The average value of the columns not specified in the arguments index and columns is calculated as a result, but the columns whose type is not numeric are excluded.

Basic usage

df.pivot_table(index="Age", columns= "Sex", values="Fare")

By default, the average value of the aggregation is returned. (Other results can be calculated by adding aggfunc to the argument.)

Specifies how to calculate the resulting value: argument aggfunc

By default, the average value is calculated, but it is possible to calculate the value by other methods by specifying a function in the argument aggfunc. By default (when the argument aggfunc is omitted), numpy.mean () is specified.

df.pivot_table(index="Age", columns= "Sex", values="Fare", aggfunc='count')

By specifying aggfunc ='count', the number of applicable data is counted.

Multiple rows / columns can be selected: Arguments index, columns

df.pivot_table(index="Age", columns= ["Pclass","Sex"], values="Survived", aggfunc='mean')

Multiple calculation methods can be specified for aggfunc

df.pivot_table(index="Age", columns= ["Pclass","Sex"], values="Survived",  aggfunc=["mean","count"])

Add total result: arguments margins, margins_name

df.pivot_table(index="Age", columns= ["Pclass","Sex"], values="Survived", aggfunc=["mean","count"],margins=True, margins_name="total")

If you want to normalize the result, use the cross_tab function: normalize argument

By specifying the argument normalize of the cross_tab function, it can be standardized for the whole, for each row, and for each column.

pd.crosstab(index=df.Age, columns= df.Sex, values=df.Fare, aggfunc='count',  normalize=True)

-If normalize = True or normalize ='all', the total is standardized to 1. -If normalize ='index' or'columns', it is standardized so that the total is 1 for each row or column.

that's all.

After operating the pivot table, the columns are multi-layered, but I cannot extract them by specifying the columns. If anyone knows, please comment ...

Reference / Citation

・ Http://yaginogogo.hatenablog.jp/entry/2016/04/22/011327 ・ Https://note.nkmk.me/python-pandas-pivot-table/ ・ Https://deepage.net/features/pandas-pivot.html ・ Https://boxil.jp/mag/a2149/ ・ Https://deepage.net/features/pandas-pivot.html ・ Https://qiita.com/kshigeru/items/bfa8c11d1e6487c791d3 ・ Https://qiita.com/hoto17296/items/3442af64c7acb682de6a

[PYTHON] I tried the pivot table function of pandas