[PYTHON] [Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (7th: Preparing to build a prediction model)

theme

This is the 7th project to make a note of the contents of hands-on, which will challenge everyone to the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. The preparation was completed last time, and it is finally in the analysis stage.

Today's work

Building a predictive model

#Divide the merged data into training data and test data
train_ = all_data[all_data['WhatIsData']=='Train'].drop(['WhatIsData','Id'], axis=1).reset_index(drop=True)
test_ = all_data[all_data['WhatIsData']=='Test'].drop(['WhatIsData','SalePrice'], axis=1).reset_index(drop=True)
#Division within training data
train_x = train_.drop('SalePrice',axis=1)
train_y = np.log(train_['SalePrice'])
#Split in test data
test_id = test_['Id']
test_data = test_.drop('Id',axis=1)

Divide the merged data into training data and test data

Check on the train side.

all_data[all_data['WhatIsData']=='Train'].drop(['WhatIsData','Id'], axis=1).reset_index(drop=True)

First, check the contents of ʻall_data [all_data ['WhatIsData'] =='Train']`. Only the Train in all_data is fetched. スクリーンショット 2020-07-06 11.55.36.png

ʻAll_data [all_data ['WhatIsData'] =='Train']. Drop (['WhatIsData','Id'], axis = 1)` Check the contents WhatIsData, Id is dropped from the column. スクリーンショット 2020-07-06 11.56.45.png

ʻAll_data [all_data ['WhatIsData'] =='Train']. Drop (['WhatIsData','Id'], axis = 1)` Check the contents. Reset the index (If it is a captured image, you can not see it by switching once ...) スクリーンショット 2020-07-06 12.00.03.png

(By the way, both train and test seem to have purposely made an array before ... I thought it was necessary to review the whole picture of that.)

Division within training data

train_x = train_.drop('SalePrice',axis=1)
train_y = np.log(train_['SalePrice'])

With train_x = train_.drop ('SalePrice', axis = 1), columns other than SalePrice are used as explanatory variables.

Prepare the objective variable with train_y = np.log (train_ ['SalePrice']). (Don't forget the last logarithmic conversion)

Split in test data

test_id = test_['Id']
test_data = test_.drop('Id',axis=1)

Are you still looking at it? .. .. As expected, the confirmation of test_id and test_data is omitted here.

Building a predictive model

I thought I'd enter, but I'm getting overwhelmed by things I don't understand, so I'll do my best to prepare without entering. Mainly word search.

StandardScaler () #scaling

[0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0] #Parameter grid

make_pipeline (scaler, ls) #pipeline generation

That's it.

Is it from the point of reading all this homework first? Can I say what I thought? I thought it was "the end of the game", but he said that everything he had done so far was pre-processing.

Recommended Posts

[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (7th: Preparing to build a prediction model)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (8th: Building a Forecast Model)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 3: Preparation for missing value complementation)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (4th: Complementing Missing Values (Complete))
[Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)