Purpose

When learning Python and implementing it

I can't remember the method I want to use with seaborn.
I don't know what arguments are needed.

Do you have any experience?

I lacked knowledge, and every time I processed the data, I checked the arguments of the seaborn method and set them.

In this article, I will explain the frequent methods of seaborn that even beginners need to understand, and the minimum arguments required for quick confirmation.

Rough flow

The data used this time is the data of "House Price" of "Kaggle".
To explain the House Price competition in a nutshell, it is a competition that predicts the selling price from the size and location of the house.
Read the data and draw a graph to understand its features.

Notes

The parameters are narrowed down as much as possible so that even beginners can easily understand how to use the method. For this reason, the graph is rather dirty. .. When I searched on various sites, I was wondering, "In the end, what is necessary when I want to display at least?", So I dare to narrow down the parameters.
However, the arguments required to display the graph are supplemented with comments before use.

(If you have any other parameters that you should set if possible, I would appreciate it if you could comment.)

Target person

I am writing for such people.

I want to start Kaggle from now on
I want to be able to visualize the characteristics of data
I want to be able to implement seaborn from scratch

What you can understand in this article

When you started data analysis, such as Kaggle tutorials "What kind of data is set?" Regarding this point, I think it is possible to suppress the basic outline of seaborn for self-visualization.

environment

Windows 10 (version 1909)
Python 3.7.6
seaborn 0.10.0
Pandas 1.0.1
matplotlib 3.2.1

Premise

I have touched Python lightly.
Have high school level math knowledge.

Graph drawing

Heat map

Official site: heatmap

By using a heat map, the strength of numerical data is visualized in color in a format like a brute force table.

This time, we will draw a graph focusing on items whose correlation coefficient (absolute value) with respect to the selling price of the house (Sale Price) is greater than 0.5.

import seaborn as sns

#Get the correlation coefficient
corr_mat = house_price.corr()

#Correlation coefficient with Sale Price(Absolute value)But 0.Create a DataFrame narrowed down to items larger than 5
top_corr_features = corr_mat.index[abs(corr_mat["SalePrice"])>0.5]


# import matplotlib.pyplot as plt
# plt.figure(figsize=(11,11)) #If you do something like that, you can display it more beautifully.

#Check the correlation
sns.heatmap(data=house_price[top_corr_features].corr(),annot=True,cmap="RdYlGn")

The contents of the arguments are as follows.

argument	Contents
data	Target data
annot	Whether to display the value in the matrix
cmap	Color type

From the point of view of whether it works, you can exclude the arguments of announce and cmap. However, from the perspective of visualizing the features,

It is easier to understand if each cell in the brute force table has a numerical value (here, the correlation coefficient).
The hue is not unified with the standard red system, and the gradation of red and green is more intuitive.

From that point, I think it is better to set this parameter at the minimum.

The results are as follows.

It is a brute force table focusing only on items with SalePrice greater than 0.5. The reason for narrowing down the data items this time is that there are about 80 data items, and if you make a brute force table of this, there is no merit to visualize it, so we are reducing the number of items.

In this example, we can see that green has a stronger correlation, but we can see that the following two are particularly highly correlated.

item	Correlation coefficient
OverallQual(Comprehensive evaluation of housing)	0.79
GrLivArea(Living area)	0.71

In this way, heatmaps can be used to determine the correlation of the data being analyzed.

Supplement

The correlation coefficient is calculated as an absolute value by using the ABS function.
The correlation coefficient (absolute value) is higher as it approaches 1.0 and lower as it approaches 0.
Due to the nature of the original data, there are many items, so the graph is drawn by narrowing down the data.

Bar plot

Shows the mean and error bars in a barplot bar chart.

[Official site: barplot] (http://seaborn.pydata.org/generated/seaborn.barplot.html)

The features of "Overall Qual", which had the highest correlation on the heat map, are displayed.

import seaborn as sns
sns.set()

sns.barplot(x=house_price.OverallQual,y=house_price.SalePrice)![distplot.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/310367/f788ef75-d123-fb53-b721-ed7a329b2de6.png)

The results are as follows.

The horizontal axis is Overall Qual (comprehensive evaluation of the house), and the vertical axis is House Price (house price).

You can see that the better the overall rating of a house, the higher the house price.

Bar graph (count plot)

Shows the number of categorical variables in a countplot bar chart.

import seaborn as sns
sns.set()

sns.countplot(x='MSSubClass', data=house_price)

The results are as follows.

The horizontal axis shows House Price, and the vertical axis shows the number of cases.

Histogram

Divides the data into sections and displays which sections are more numerous.

import seaborn as sns
sns.set()

sns.distplot(house_price['SalePrice'])

The results are as follows.

Official site: distplot

Scatter plot

It is a simple scatter plot.

import seaborn as sns
sns.set()

sns.scatterplot(data=house_price, x='GrLivArea', y='SalePrice')

The results are as follows.

Official site: scatterplot

Multiple display of graphs (pair plot)

You can brute force the scatter plots at once instead of one by one. This is useful when you want to comprehensively check the correlation.

import seaborn as sns
sns.set()

#Display the graph by narrowing down the items that have a high correlation with SalePrice
sns.pairplot(data=house_price[top_corr_features])

The results are as follows.

Official site: pairplot

Finally

There are other plots such as box plots and violin plots, but they are excluded from this explanation because of their frequent occurrence.
Right now, I'm focusing on the frequent methods and the minimum arguments, but it's a bit confusing, so I'll clarify it.

Reference site

Based on the following sites, I proceeded with learning while checking the official site.

This article is really minimal, so if you want to learn more, please take a look.

[PYTHON] I tried using the frequently used seaborn method with as few arguments as possible [for beginners]

Purpose

Rough flow

Notes

Target person

What you can understand in this article

environment

Premise

Graph drawing

Heat map

Supplement

Bar plot

Bar graph (count plot)

Histogram

Scatter plot

Multiple display of graphs (pair plot)

Finally

Reference site