theme

This is the 6th project to make a note of the contents of hands-on, where everyone will challenge the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. The preparation was completed last time, and it is finally in the analysis stage.

Original theme: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Referenced article: https://yolo-kiyoshi.com/2018/12/17/post-1003/

Today's work

Distribution transformation of objective variable

Objective variable: "Y, isn't it?" → Myself "..."
Objective variable: http://www.gen-info.osaka-u.ac.jp/testdocs/tomocom/express/express8.html

Check the distribution of SalePrice (house price) of the training data. It was found that most homes do not have a pool at the point of filling up the deficiency. This means that there are some mansions that have pools on the flip side, and the distribution of house prices may be quite distorted. Is assumed.

I recall that it is important to draw based on such temporary construction. However, first of all, the graph is output as it is said.

sns.distplot(train['SalePrice'])

About seaborn

"What is sns?" I forgot it after the beginning, but it was in the library I was importing first. This.

import seaborn as sns

I see seaborn

seaborn: Apparently a library for drawing graphs.
See seaborn: https://qiita.com/hik0107/items/3dc541158fceb3156ee0
distplot: A method to draw a histogram with seaborn.

Check what was in train ['Sale Price']

After that, just in case, check the contents in train ['Sale Price']. I see, the rows where each is lined up. スクリーンショット 2020-06-29 12.07.02.png

Output graph

And the output graph looks like this.

sns.distplot(train['SalePrice'])

Logarithmic conversion

As expected, the base of the distribution extends to the far right. By performing logarithmic conversion, it approaches a normal distribution.

However, confirmation of "What is logarithmic conversion?"

See logarithmic conversion: https://atarimae.biz/archives/13161#:~:text=%E5%AF%BE%E6%95%B0%E5%A4%89%E6%8F%9B%E3%81 % A8% E3% 81% AF% E3% 80% 81% E3% 80% 8C% E5% AF% BE% E6% 95% B0,% E3% 81% 99% E3% 82% 8B% E3% 81% 93% E3% 81% A8% E3% 82% 92% E6% 8C% 87% E3% 81% 97% E3% 81% BE% E3% 81% 99% E3% 80% 82 & text =% E5% 85% B7 % E4% BD% 93% E7% 9A% 84% E3% 81% AB% E3% 80% 81% E8% AA% AC% E6% 98% 8E% E5% A4% 89% E6% 95% B0,% E8% 80% 83% E3% 81% 88% E3% 81% A6% E3% 81% BF% E3% 81% BE% E3% 81% 97% E3% 82% 87% E3% 81% 86% E3% 80% 82

sns.distplot(np.log(train['SalePrice']))

Array changes before and after logarithmic conversion

I will output this much.

np.log(train['SalePrice'])

I see, it's crushed. スクリーンショット 2020-06-29 12.17.28.png

Output graph 2

sns.distplot(np.log(train['SalePrice']))

I feel that it has a fairly normal distribution.

Building a predictive model

I wanted to enter, but apparently it smells like the time has run out, so that's it for today.

Since the amount of variables is quite large this time, we want to impose a strong penalty on the coefficients, so we will build a prediction model using Lasso regression.

After the preparation, I investigated the Lasso regression and finished.

Lasso regression

See Lasso regression: https://aizine.ai/ridge-lasso-elasticnet/

That's it.

After entering the analysis layer, I realized that it was necessary to supplement the background knowledge. Mainly about regression analysis.

[PYTHON] [Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)