[PYTHON] I tried the pivot table function of pandas

Functions that are unexpectedly necessary when creating tables and graphs. I tried to summarize such a pivot table function that there is a difference between heaven and earth if you do not know it. By the way, when I was a student, I didn't even know the pivot table function of Excel, and I remember crying and shaping the experimental data ... Since it was posted to Qiita, I tried it with the pandas data frame this time.

What is a pivot table?

One of the functions in Excel, it refers to a tool that extracts necessary data from a list registered in advance and aggregates and analyzes it from all directions. The "aggregate" function is especially convenient, and you can return the result shown on the right from the data format shown on the left in the figure below.

image.png

You will need it when you want to cross-tabulate or when you want to create a slightly complicated graph. This time I'll try something similar on a pandas dataframe instead of Excel.

Data used for practice

Borrowed train data from kaggle Titanic Competition. Please refer to here for the meaning of each column.

Main subject

It looks like this as it is ... image.png Difficult to interpret!

Preparation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Path
input_path = "../input_data/"

# Set Display Max Columns
pd.set_option("display.max_columns", 50)

train = pd.read_csv(input_path + "train.csv", sep=",", header=0, quotechar="\"")
# test = pd.read_csv(input_path + "bank/test.csv", sep=",", header=0, quotechar="\"")

#Missing value processing because it is difficult to handle
train["Embarked"] = train.Embarked.fillna("S") 
train["Age"]=train["Age"].fillna(train.Age.mean())

#Age by age
labels = [ "{0} - {1}".format(i, i + 9) for i in range(0, 100, 10) ]
train["Age"] = pd.cut(train["Age"], np.arange(0, 101, 10),
                      include_lowest=True, right=False,
                      labels=labels)

#Focus on what seems to be easy to handle for this purpose
df = train[["PassengerId", "Survived", "Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]

#Data confirmation
df.head()

image.png

The names of passengers who seem to be difficult to handle are deleted.

Use the pivot_table function

At a minimum, the following three arguments are required

-Data (first argument): Specify the pandas.DataFrame object of the original data. -Index: Specify the column name of the original data. It becomes the result line heading. -Columns: Specify the column name of the original data. The resulting column heading.

The average value of the columns not specified in the arguments index and columns is calculated as a result, but the columns whose type is not numeric are excluded.

Basic usage

df.pivot_table(index="Age", columns= "Sex", values="Fare")

image.png By default, the average value of the aggregation is returned. (Other results can be calculated by adding aggfunc to the argument.)

Specifies how to calculate the resulting value: argument aggfunc

By default, the average value is calculated, but it is possible to calculate the value by other methods by specifying a function in the argument aggfunc. By default (when the argument aggfunc is omitted), numpy.mean () is specified.

df.pivot_table(index="Age", columns= "Sex", values="Fare", aggfunc='count')

image.png

By specifying aggfunc ='count', the number of applicable data is counted.

Multiple rows / columns can be selected: Arguments index, columns

df.pivot_table(index="Age", columns= ["Pclass","Sex"], values="Survived", aggfunc='mean')

image.png

Multiple calculation methods can be specified for aggfunc

df.pivot_table(index="Age", columns= ["Pclass","Sex"], values="Survived",  aggfunc=["mean","count"])

image.png

Add total result: arguments margins, margins_name

df.pivot_table(index="Age", columns= ["Pclass","Sex"], values="Survived", aggfunc=["mean","count"],margins=True, margins_name="total")

image.png

If you want to normalize the result, use the cross_tab function: normalize argument

By specifying the argument normalize of the cross_tab function, it can be standardized for the whole, for each row, and for each column.

pd.crosstab(index=df.Age, columns= df.Sex, values=df.Fare, aggfunc='count',  normalize=True)

image.png

-If normalize = True or normalize ='all', the total is standardized to 1. -If normalize ='index' or'columns', it is standardized so that the total is 1 for each row or column.

that's all.

After operating the pivot table, the columns are multi-layered, but I cannot extract them by specifying the columns. If anyone knows, please comment ...

Reference / Citation

・ Http://yaginogogo.hatenablog.jp/entry/2016/04/22/011327 ・ Https://note.nkmk.me/python-pandas-pivot-table/ ・ Https://deepage.net/features/pandas-pivot.html ・ Https://boxil.jp/mag/a2149/ ・ Https://deepage.net/features/pandas-pivot.html ・ Https://qiita.com/kshigeru/items/bfa8c11d1e6487c791d3 ・ Https://qiita.com/hoto17296/items/3442af64c7acb682de6a

Recommended Posts

I tried the pivot table function of pandas
I tried a little bit of the behavior of the zip function
I tried to fight the Local Minimum of Goldstein-Price Function
I tried the asynchronous server of Django 3.0
I tried to get the index of the list using the enumerate function
[Python] I tried substituting the function name for the function name
I tried cluster analysis of the weather map
I tried to touch the API of ebay
[Memo] I tried a pivot table in Python
I tried to correct the keystone of the image
I tried using the image filter of OpenCV
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
I tried to learn the sin function with chainer
I tried to summarize the basic form of GPLVM
I tried the MNIST tutorial for beginners of tensorflow.
Drawing on Jupyter using the plot function of pandas
I tried to approximate the sin function using chainer
I examined the argument class_weight of Chainer's softmax_cross_entropy function.
I tried using the API of the salmon data project
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried scraping the advertisement of the pirated cartoon site
I tried the simplest method of multi-label document classification
I tried to classify the voices of voice actors
I tried running the sample code of the Ansible module
I tried to summarize the string operations of Python
I tried the changefinder library!
I tried Pandas' Sql Upsert
The Power of Pandas: Python
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to find the entropy of the image with python
[Horse Racing] I tried to quantify the strength of racehorses
I tried "gamma correction" of the image with Python + OpenCV
I tried to get the location information of Odakyu Bus
I tried the accuracy of three Stirling's approximations in python
I tried to find the average of the sequence with TensorFlow
I tried refactoring the CNN model of TensorFlow using TF-Slim
I tried face recognition of the laughter problem using Keras.
I tried morphological analysis of the general review of Kusoge of the Year
I made a function to check the model of DCGAN
I tried to summarize the code often used in Pandas
[Python] I tried to visualize the follow relationship of Twitter
I tried to implement the mail sending function in Python
[Machine learning] I tried to summarize the theory of Adaboost
[Python] I tried collecting data using the API of wikipedia
I displayed the chat of YouTube Live and tried playing
I tried to approximate the sin function using chainer (re-challenge)
I tried AdaNet on table data
I tried the TensorFlow tutorial 1st
I tried the Naro novel API 2
I investigated the mechanism of flask-login!
I tried using GrabCut of OpenCV
I tried the TensorFlow tutorial 2nd
I tried the Naruro novel API
I tried to move the ball
I tried using the checkio API
I tried to estimate the interval.
I tried scraping the ranking of Qiita Advent Calendar with Python
[Linux] I tried to summarize the command of resource confirmation system
I tried to get a database of horse racing using Pandas