When learning Python and implementing it
Do you have any experience?
I lacked knowledge, and every time I processed the data, I checked the arguments of the seaborn method and set them.
In this article, I will explain the frequent methods of seaborn that even beginners need to understand, and the minimum arguments required for quick confirmation.
The data used this time is the data of "House Price" of "Kaggle".
To explain the House Price competition in a nutshell, it is a competition that predicts the selling price from the size and location of the house.
Read the data and draw a graph to understand its features.
The parameters are narrowed down as much as possible so that even beginners can easily understand how to use the method. For this reason, the graph is rather dirty. .. When I searched on various sites, I was wondering, "In the end, what is necessary when I want to display at least?", So I dare to narrow down the parameters.
However, the arguments required to display the graph are supplemented with comments before use.
(If you have any other parameters that you should set if possible, I would appreciate it if you could comment.)
I am writing for such people.
When you started data analysis, such as Kaggle tutorials "What kind of data is set?" Regarding this point, I think it is possible to suppress the basic outline of seaborn for self-visualization.
By using a heat map, the strength of numerical data is visualized in color in a format like a brute force table.
This time, we will draw a graph focusing on items whose correlation coefficient (absolute value) with respect to the selling price of the house (Sale Price) is greater than 0.5.
import seaborn as sns
#Get the correlation coefficient
corr_mat = house_price.corr()
#Correlation coefficient with Sale Price(Absolute value)But 0.Create a DataFrame narrowed down to items larger than 5
top_corr_features = corr_mat.index[abs(corr_mat["SalePrice"])>0.5]
# import matplotlib.pyplot as plt
# plt.figure(figsize=(11,11)) #If you do something like that, you can display it more beautifully.
#Check the correlation
sns.heatmap(data=house_price[top_corr_features].corr(),annot=True,cmap="RdYlGn")
The contents of the arguments are as follows.
argument | Contents |
---|---|
data | Target data |
annot | Whether to display the value in the matrix |
cmap | Color type |
From the point of view of whether it works, you can exclude the arguments of announce and cmap. However, from the perspective of visualizing the features,
From that point, I think it is better to set this parameter at the minimum.
The results are as follows.
It is a brute force table focusing only on items with SalePrice greater than 0.5. The reason for narrowing down the data items this time is that there are about 80 data items, and if you make a brute force table of this, there is no merit to visualize it, so we are reducing the number of items.
In this example, we can see that green has a stronger correlation, but we can see that the following two are particularly highly correlated.
item | Correlation coefficient |
---|---|
OverallQual(Comprehensive evaluation of housing) | 0.79 |
GrLivArea(Living area) | 0.71 |
In this way, heatmaps can be used to determine the correlation of the data being analyzed.
Shows the mean and error bars in a barplot bar chart.
[Official site: barplot] (http://seaborn.pydata.org/generated/seaborn.barplot.html)
The features of "Overall Qual", which had the highest correlation on the heat map, are displayed.
import seaborn as sns
sns.set()
sns.barplot(x=house_price.OverallQual,y=house_price.SalePrice)![distplot.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/310367/f788ef75-d123-fb53-b721-ed7a329b2de6.png)
The results are as follows.
The horizontal axis is Overall Qual (comprehensive evaluation of the house), and the vertical axis is House Price (house price).
You can see that the better the overall rating of a house, the higher the house price.
Shows the number of categorical variables in a countplot bar chart.
import seaborn as sns
sns.set()
sns.countplot(x='MSSubClass', data=house_price)
The results are as follows.
The horizontal axis shows House Price, and the vertical axis shows the number of cases.
Divides the data into sections and displays which sections are more numerous.
import seaborn as sns
sns.set()
sns.distplot(house_price['SalePrice'])
The results are as follows.
It is a simple scatter plot.
import seaborn as sns
sns.set()
sns.scatterplot(data=house_price, x='GrLivArea', y='SalePrice')
The results are as follows.
You can brute force the scatter plots at once instead of one by one. This is useful when you want to comprehensively check the correlation.
import seaborn as sns
sns.set()
#Display the graph by narrowing down the items that have a high correlation with SalePrice
sns.pairplot(data=house_price[top_corr_features])
The results are as follows.
There are other plots such as box plots and violin plots, but they are excluded from this explanation because of their frequent occurrence.
Right now, I'm focusing on the frequent methods and the minimum arguments, but it's a bit confusing, so I'll clarify it.
Based on the following sites, I proceeded with learning while checking the official site.
This article is really minimal, so if you want to learn more, please take a look.
Recommended Posts