[PYTHON] About Boxplot and Violinplot that visualize the variability of independent data

Introduction

This article uses Python 2.7, numpy 1.11, scipy 0.17, scikit-learn 0.18, matplotlib 1.5, seaborn 0.7, pandas 0.17. It has been confirmed to work on jupyter notebook. (Please modify% matplotlib inline appropriately) Use seaborn boxplot and violinplot.

table of contents

  1. Data generation
  2. Boxplot
  3. Violinplot
  4. Finally
  5. Reference

1. Data generation

If you have your own data, please ignore this.

Use make_classification of here to create 1000 samples of 2D 2 class data. Furthermore, let A and B be the two numerical data, and let sex be the label data. In addition, numpy.random.binomial () randomly generates 0, 1, 2 and Concatenate them to make types.

make_classification.py


import numpy as np
from sklearn.datasets import make_classification
import pandas as pd

x, y = make_classification(n_samples=1000, n_features=2, n_redundant=0, n_informative=2,n_clusters_per_class=2, n_classes=2)
data = np.c_[np.c_[x, y], np.random.binomial(2, .5, len(x))]
data = pd.DataFrame(data).rename(columns={0:'A', 1:'B', 2:'sex', 3:'types'})

The contents of data look like this

          A         B  sex  types
0  2.131411 -1.754907    0      1
1 -0.046614 -1.009540    0      2
2  0.136387 -0.236662    1      1
3 -3.515190  2.117925    1      1
4 -2.099287  1.647548    1      1
5 -0.536360 -0.920529    0      0
6  0.281726 -0.572448    1      2
7  2.202351 -3.214435    0      1
8 -0.825666  0.847394    1      0
9 -1.602873  1.338847    1      2

With this, we have generated two numerical data including two types of category data.

  1. Boxplot It is suitable for visualizing the variation of numerical data including two types of category data. Use seaborn's boxplot.

boxplot.py


import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

sns.boxplot(x='types', y="A", hue='sex', data=data, palette="PRGn")
sns.despine(offset=10, trim=True)

download (2).png

I was able to draw a box plot for each sex and type. As you can see in wikipedia, the middle line is The median and the top and bottom of the box are the 1st and 3rd quartiles, respectively, and the top and bottom of the beard are the maximum and minimum values, respectively. The upper and lower points mean "outliers" when judged from the 1st and 3rd quartiles. Let's visualize other numerical values B as well.

download (1).png

The numerical value B seems to represent the gender difference significantly. (Data is different each time it is generated) Regarding types, it may be difficult to classify just by looking at this data. In this way, boxplot can be used to express the variation of two types of category data in an easy-to-understand manner.

  1. violinplot Similar to boxplot, it visualizes the difference between numerical data including two types of category data. Here, each numerical data is expressed as a distribution.

Data preparation

Deform the DataFrame using the melt function of pandas.

melt.py


data_batch = pd.melt(data, id_vars = ['types', 'sex'], value_vars = data.columns[:-2].tolist())
print data_batch[:10]

By doing this, you can "Unpivot" the DataFrame. Here is the execution result.

   types  sex variable     value
0      1    0        A  2.131411
1      2    0        A -0.046614
2      1    1        A  0.136387
3      1    1        A -3.515190
4      1    1        A -2.099287
5      0    0        A -0.536360
6      2    1        A  0.281726
7      1    0        A  2.202351
8      0    1        A -0.825666
9      2    1        A -1.602873

The column name of the numerical data is variable, and the numerical value is value.

Create violin plot

Visualize the prepared "Unpivot" data using violinplot.

violinplot.py


data_batch_A = data_batch[data_batch.variable=='A']
sns.violinplot(x = 'types',  y = 'value', hue = 'sex', data = data_batch_A, split=True)
sns.despine(offset=10, trim=True)

download (3).png

The plot looks like the left and right objects are emphasized. In boxplot, I looked at the median and quartile, so I felt that the whole was a normal distribution. On the other hand, since violin plot visualizes the cumulative value itself, it is possible to observe multiple peaks (multimodal) in each type of data. Similarly, visualize the numerical data of B.

download (4).png

As with boxplot, you can see that the distribution of the numerical value B is clearly divided for each sex. Regarding the type, isn't it impossible to classify by looking at the shape of the distribution? I feel that.

Finally

I introduced boxplot and violinplot. Boxplot may be useful if you want to focus on the quartile and median, and violinplot if you want to see the shape and multimodality of the distribution. Either way, it is convenient for visualizing the data when viewed as an independent variable without considering the correlation between the data.

reference

Summary of scikit-learn data sources that can be used when writing analysis articles boxplot violinplot

Recommended Posts

About Boxplot and Violinplot that visualize the variability of independent data
This and that of the inclusion notation.
Visualize the export data of Piyo log
About the behavior of copy, deepcopy and numpy.copy
About the * (asterisk) argument of python (and itertools.starmap)
Talking about the features that pandas and I were in charge of in the project
About data preprocessing of systems that use machine learning
Visualize the range of interpolation and extrapolation with python
Visualize data and understand correlation at the same time
Think about the next generation of Rack and WSGI
About the inefficiency of data transfer in luigi on-memory
Personal notes about the integration of vscode and anaconda
It's time to seriously think about the definition and skill set of data scientists
Let's make the analysis of the Titanic sinking data like that
Data processing that eliminates the effects of confounding factors (theory)
This and that about pd.DataFrame
About the ease of Python
Visualize the orbit of Hayabusa2
About the components of Luigi
About the features of Python
About data management of anvil-app-server
Beginning of Nico Nico Pedia analysis ~ JSON and touch the provided data ~
Find the sensor installation location that maximizes the amount of acquired data
[Python] About creating a tool to create a new Outlook email based on the data of the JSON file and the part that got caught
Embedding method DensMAP that reflects the density of distribution of high-dimensional data
About the bug that anaconda fails to import numpy and scipy
Verification of the theory that "Python and Swift are quite similar"
The story of Python and the story of NaN
About the return value of pthread_mutex_init ()
About the return value of the histogram.
About the basic type of Go
This and that of python properties
About the upper limit of threads-max
About time series data and overfitting
Overlay and visualize Geo data and statistical data
About the behavior of yield_per of SqlAlchemy
About the size of matplotlib points
Visualize the response status of the census 2020
About the basics list of Python basics
I tried to visualize the age group and rate distribution of Atcoder
[Python] Visualize the heat of Tokyo and XX prefectures (DataFrame usage memo)
About "spleeter" that can separate vocals and musical instruments from music data
Look up the names and data of free variables in function objects
Summary of probability distributions that often appear in statistics and data analysis
A note about the functions of the Linux standard library that handles time
Data cleansing of open data of the occurrence situation of the Ministry of Health, Labor and Welfare
About the main tasks of image processing (computer vision) and the architecture used