[PYTHON] Quickly visualize with Pandas

There are several libraries that visualize data in Python, but Pandas alone is pretty good. Visualization with Pandas can be completed in a method chain, which can slightly prevent the clutter of temporary variables. In this article, I will introduce visualization recipes, focusing on the ones that I often use in practice.

Preparation

environment

data

This time, I will borrow the following two data.

Make the DataFrames titanic and crime respectively.

import pandas as pd
import zipfile

with zipfile.ZipFile('titanic.zip') as myzip:
    with myzip.open('train.csv') as myfile:
        titanic = pd.read_csv(myfile)

with zipfile.ZipFile('crimes-in-boston.zip') as myzip:
    with myzip.open('crime.csv') as myfile:
        crime = pd.read_csv(myfile, encoding='latin-1', parse_dates=['OCCURRED_ON_DATE'])

Visualization recipe

Histogram

This is the quickest way to see the distribution of numerical data. Bar charts may be more appropriate when there are few unique values.

titanic['Age'].plot.hist()

image.png

Box Plot

Used when looking at quartiles. Points outside the box length x 1.5 are marked as outliers. Violin plots cannot be drawn with Pandas, so give up and use Seaborn.

titanic['Age'].plot.box()

image.png

Kernel Density Optimization

It is a method to estimate PDF from data, but if it is one-dimensional, a histogram may be enough. For more information on Python kernel density estimation, see here. Since it uses scipy, if it is not installed, install it with pip install scipy.

titanic['Age'].plot.kde()

image.png

Scatter Plot

It is used to see the relationship between real numbers. If the points overlap too much, the density will not be known, so I think it is standard to make it transparent. If either one is a category or has few unique values, it is better to use the grouped histograms and boxplots described below.

titanic.plot.scatter(x='Age', y='Fare', alpha=0.3)

image.png

Hexagonal Binning Plot

I have never used it, but I will introduce it for the time being.

titanic.plot.hexbin(x='Age', y='Fare', gridsize=30)

image.png

Bar Plot

It is often used to see aggregated values for each category.

titanic['Embarked'].value_counts(dropna=False).plot.bar()

image.png

Horizontal Bar Plot

I tried to lie down.

titanic['Embarked'].value_counts(dropna=False).plot.barh()

image.png

Horizontal Bar Plot with DataFrame Styling

You can make the DataFrame look like a bar graph. I use it a lot because it allows me to search by text.

titanic['Embarked'].value_counts(dropna=False).to_frame().style.bar(vmin=0)

image.png

Line Plot

It is often used to see changes in the series.

crime['OCCURRED_ON_DATE'].dt.date.value_counts().plot.line(figsize=(16, 4))

image.png

Area Plot

As with the line graph, we see the changes in the series, but we see the magnitude from zero. However, if it is too fine, it will be difficult to see the valley, so it is better to discretize it a little.

crime['OCCURRED_ON_DATE'].dt.date.value_counts().plot.area(figsize=(16, 4), linewidth=0)

image.png

Pie Plot

I don't use pie charts because they are difficult to understand, but I will introduce them for the time being. The reasons why pie charts are difficult to understand are summarized in the following article.

-Do you still use pie charts? --Data Visualization Ideabook

titanic['Embarked'].value_counts(dropna=False).plot.pie()

image.png

Grouped Histogram

Often used to compare the distribution between two groups. (It doesn't have to be 2 groups)

titanic.groupby('Survived')['Age'].plot.hist(alpha=0.5, legend=True)

Or

titanic['Age'].groupby(titanic['Survived']).plot.hist(alpha=0.5, legend=True)

So, in the latter case, you can use an external Series.

image.png

Grouped Box Plot

It doesn't work with groupby, so write as follows.

titanic.boxplot(column='Age', by='Survived')

image.png

Grouped Kernel Density Estimation

It may be used to compare the distribution between two groups as well as the histogram.

titanic['Age'].groupby(titanic['Survived']).plot.kde(legend=True)

image.png

Grouped Scatter Plot

I think I use it often, but I can't write smartly. If it is group by, it will be returned as a list.

titanic.groupby('Survived').plot.scatter(x='Age', y='Fare', alpha=0.3)

image.png image.png

It cannot be used unless the key is numerical data, but if you write it as follows, it will be a scatter plot of different colors for each group.

titanic.plot.scatter(x='Age', y='Fare', c='Survived', cmap='viridis', alpha=0.3)

image.png

Pandas Official Documentation shows how to share Axis and draw two graphs. ..

ax = titanic[titanic['Survived'] == 0].plot.scatter(x='Age', y='Fare', label=0, alpha=0.3)
titanic[titanic['Survived'] == 1].plot.scatter(x='Age', y='Fare', c='tab:orange', label=1, alpha=0.3, ax=ax)

image.png

Grouped Hexagonal Binning Plot

titanic.groupby('Survived').plot.hexbin(x='Age', y='Fare', gridsize=30)

image.png image.png

Grouped Bar Plot

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.bar()

image.png

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.bar()

image.png

Grouped Horizontal Bar Plot

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.barh()

image.png

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.barh()

image.png

Grouped Horizontal Bar Plot with DataFrame Styling

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).style.bar(vmin=0, axis=None)

image.png

Grouped Line Plot

crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).plot.line(figsize=(16, 4), alpha=0.5)

image.png

crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).iloc[:, :4].plot.line(figsize=(16, 4), alpha=0.5)

image.png

Stacked Area Plot

crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).plot.area(figsize=(16, 4), linewidth=0)

image.png

crime['OCCURRED_ON_DATE'].dt.date.groupby(crime['DISTRICT']).value_counts().unstack(0).iloc[:, :4].plot.area(figsize=(16, 4), linewidth=0)

image.png

Grouped Pie Plot

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.pie(subplots=True)

image.png

Stacked Bar Plot

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.bar(stacked=True)

image.png

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.bar(stacked=True)

image.png

Stacked Horizontal Bar Plot

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack().plot.barh(stacked=True)

image.png

titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0).plot.barh(stacked=True)

image.png

Percent Stacked Bar Plot

To draw a 100% stacked bar chart, you have to calculate the percentage.

(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack()
 .div(titanic['Survived'].value_counts(dropna=False), axis=0)
 .plot.bar(stacked=True))

image.png

(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0)
 .div(titanic['Embarked'].value_counts(dropna=False), axis=0)
 .plot.bar(stacked=True))

image.png

Percent Stacked Horizontal Bar Plot

(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack()
 .div(titanic['Survived'].value_counts(dropna=False), axis=0)
 .plot.barh(stacked=True))

image.png

(titanic['Embarked'].groupby(titanic['Survived']).value_counts(dropna=False).unstack(0)
 .div(titanic['Embarked'].value_counts(dropna=False), axis=0)
 .plot.barh(stacked=True))

image.png

Overlay Plots

Overlay the histogram and the kernel density estimation graph.

titanic['Age'].groupby(titanic['Survived']).plot.hist(alpha=0.5, legend=True)
titanic['Age'].groupby(titanic['Survived']).plot.kde(legend=True, secondary_y=True)

image.png

Grouped Bar Plot with Error Bars

You have to calculate the standard error to draw the error bar.

yerr = titanic.groupby(['Survived', 'Pclass'])['Fare'].std().unstack(0)
titanic.groupby(['Survived', 'Pclass'])['Fare'].mean().unstack(0).plot.bar(yerr=yerr)

image.png

Heat Map with DataFrame Styling

(pd.crosstab(crime['DAY_OF_WEEK'], crime['HOUR'].div(3).map(int).mul(3), normalize=True)
 .reindex(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
 .style.background_gradient(axis=None).format('{:.3%}'))

image.png

If you change the color map, it will look like a lawn.

(pd.crosstab(crime['DAY_OF_WEEK'], crime['MONTH'], normalize=True)
 .reindex(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
 .style.background_gradient(axis=None, cmap='YlGn').format('{:.3%}'))

image.png

Correlation Heat Map with DataFrame Styling

I will introduce it in the next article.

-[One line] Heatmap the correlation matrix with Pandas only

corr = titanic.corr()
low = (1 + corr.values.min()) / (1 - corr.values.min())
corr.style.background_gradient(axis=None, cmap='viridis', low=low).format('{:.6f}')

image.png

the end

I introduced the ones that seem to be relatively easy to use. There is also such a thing! Please let me know. If you want to draw a more elaborate graph, the next page will be helpful.

Recommended Posts

Quickly visualize with Pandas
Quickly try to visualize datasets with pandas
Processing datasets with pandas (1)
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Visualize data with Streamlit
Learn Pandas with Cheminformatics
Visualize claims with AI
Visualize 2019 nem with WordCloud
Interactively visualize data with TreasureData, Pandas and Jupyter.
Data visualization with pandas
Data manipulation with Pandas!
Shuffle data with pandas
Read csv with python pandas
Visualize 2ch threads with WordCloud-Scraping-
[Python] Change dtype with pandas
Visualize location information with Basemap
Standardize by group with pandas
Visualize Wikidata knowledge with Neo4j
Prevent omissions with pandas print
Data processing tips with Pandas
[In one line] Visualize like a lawn with just Pandas
Extract the maximum value with pandas.
Visualize decision trees with jupyter notebook
Pandas
Versatile data plotting with pandas + matplotlib
[Python] Join two tables with pandas
Visualize python package dependencies with graphviz
Dynamically create new dataframes with pandas
Extract specific multiple columns with pandas
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Convenient analysis with Pandas + Jupyter notebook
Draw a graph with pandas + XlsxWriter
Manipulating strings with pandas group by
Bulk Insert Pandas DataFrame with psycopg2
I want to do ○○ with Pandas
Create an age group with pandas
Excel aggregation with Python pandas Part 1
[Python] Format when to_csv with pandas
Feature generation with pandas group by
Handle various date formats with pandas
Ssh login with public key authentication quickly
Plot the Nikkei Stock Average with pandas
Load csv with duplicate columns in pandas
Import of japandas with pandas 1.0 and above
I tried to visualize AutoEncoder with TensorFlow
Quickly create an excel file with Python #python
Excel aggregation with Python pandas Part 2 Variadic
Tips for plotting multiple lines with pandas
Try converting to tidy data with pandas
Draw hierarchical axis labels with matplotlib + pandas
Visualize latitude / longitude coordinate information with kepler.gl
Visualize 2ch threads with WordCloud-Morphological analysis / WordCloud-
Visualize point P that works with Python
[Python] Quickly create an API with Flask
Replace column names / values with pandas dataframe
[Easy Python] Reading Excel files with pandas
Quickly implement S3 compatible storage with python-flask
Load csv with pandas and play with Index