[PYTHON] House Prices: Advanced Regression Techniques

0.Intro It's hard to start everything, and then I think it's a mountain to bring it to a place where "it's my own shape" and "somewhat quite so". It seems that he submitted Titanic 7 months ago. It's been over half a year since then. It's not that I didn't do anything, but I couldn't get enough time. However, fortunately or unfortunately, I had a little longer vacation (about a month), so I wanted to take this opportunity to shape something like Kaggle's own way, so I resumed my efforts. As a result, I was able to make a single shape, so I decided to write it as an article.

1. Before you start

Machine learning must be programming, but it seems that the coat color is quite different from so-called ordinary programming. After all, normal programming is staring at data, but as you know, machine learning programming has an extremely large weight of data processing. Therefore, it took some time for the work to be done in a way that fits perfectly.

1.1. Development environment

I put Ubuntu in WSL on Windows 10 this time and built an environment of Anaconda there. Coding and code execution are done with VS code, and data confirmation etc. are done with Jupyter notebook. It would be a shame if we could develop it firmly in the future, but since there is a server machine with a grabber for the time being, I would like to move the development environment to that.

1.2. About the code structure

If you can't image the whole code, it means that you can't imagine what you're making, and you can't do good work with that. Results of referencing various Kaggle kernels

def feature_create(train,test):
#Data cleaning
#Create data such as variable dumming

def model_create(train):
#Divide the created data into objective variables and others,
#Standardize and create a model.


if __name__ == "__main__":
#Main logic, import data here, and the above function
#feature_create(train,test)And model_create(train)Call
#Finally apply the model to the test data and get the prediction result
#Create csv data for submit.

In the form of, the outline is made up of three stages. I've just added a few subroutines this time, but it seems that the main and two functions can be used.

1.3. Development style

The development of machine learning is heavily weighted toward how to process data, so it is not possible to proceed by looking only at the editor. As a result of various trials and errors, the form is to check the data with Jupyter notebook, check the code to edit the data on it, and add the successful code to the feature_create function in the above configuration. It looked good.

2. The whole program

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import  cross_val_score

def feature_create(train,test):
    df_train = train.copy()
    df_test = test.copy()
    df_train["evel_set"] = 0
    df_test["evel_set"] = 1
    df_temp = pd.concat([df_train,df_test])
    del df_train,df_test
    df_temp.drop(["MiscFeature","MiscVal","PoolQC","PoolArea","Fireplaces","FireplaceQu","Alley","Fence"],axis=1,inplace=True)
    df_temp["MasVnrType"][df_temp["MasVnrType"].isnull()==True]="None"
    df_temp["MasVnrArea"][df_temp["MasVnrArea"].isnull()==True] = df_temp["MasVnrArea"].mean()
    df_temp["BsmtQual"][df_temp["BsmtQual"].isnull()==True] = "TA"
    df_temp["BsmtCond"][df_temp["BsmtCond"].isnull()==True] = "TA"
    df_temp["BsmtExposure"][df_temp["BsmtExposure"].isnull()==True] = "No"
    df_temp["BsmtFinType1"][df_temp["BsmtFinType1"].isnull()==True] = "Unf"
    df_temp["BsmtFinType2"][df_temp["BsmtFinType2"].isnull()==True] = "Unf"
    df_temp["Electrical"][df_temp["Electrical"].isnull()==True] = "SBrkr"
    df_temp["GarageType"][df_temp["GarageType"].isnull()==True] = "Attchd"
    df_temp["GarageYrBlt"][df_temp["GarageYrBlt"].isnull()==True] = df_temp["GarageYrBlt"][df_temp["GarageYrBlt"] > 2000].mean()
    df_temp["GarageFinish"][df_temp["GarageFinish"].isnull()==True] = "Unf"
    df_temp["GarageQual"][df_temp["GarageQual"].isnull()==True] = "TA"
    df_temp["GarageCond"][df_temp["GarageCond"].isnull()==True] = "TA"
    df_temp["BsmtFinSF1"][df_temp["BsmtFinSF1"].isnull()==True] = 0
    df_temp["BsmtFinSF2"][df_temp["BsmtFinSF2"].isnull()==True] = 0
    df_temp["BsmtFullBath"][df_temp["BsmtFullBath"].isnull()==True] = 0
    df_temp["BsmtHalfBath"][df_temp["BsmtHalfBath"].isnull()==True] = 0
    df_temp["BsmtUnfSF"][df_temp["BsmtUnfSF"].isnull()==True] = 0
    df_temp["Exterior1st"][df_temp["Exterior1st"].isnull()==True] = "VinylSd"
    df_temp["Exterior2nd"][df_temp["Exterior2nd"].isnull()==True] = "VinylSd"
    df_temp["Functional"][df_temp["Functional"].isnull()==True] = "Typ"
    df_temp["GarageArea"][df_temp["GarageArea"].isnull()==True] = 576
    df_temp["GarageCars"][df_temp["GarageCars"].isnull()==True] = 2
    df_temp["KitchenQual"][df_temp["KitchenQual"].isnull()==True] = "TA"
    df_temp["LotFrontage"][df_temp["LotFrontage"].isnull()==True] = 60
    df_temp["MSZoning"][df_temp["MSZoning"].isnull()==True] = "RL"
    df_temp["SaleType"][df_temp["SaleType"].isnull()==True] = "WD"
    df_temp["TotalBsmtSF"][df_temp["TotalBsmtSF"].isnull()==True] = 0
    df_temp["Utilities"][df_temp["Utilities"].isnull()==True] = "AllPub"

    #df_temp.drop(["MSSubClass","MSZoning","Street","LotShape","LandContour","Utilities","LotConfig","LandSlope","Neighborhood","Condition1","Condition2","BldgType","HouseStyle","OverallCond","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation","BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","BsmtFinSF2","BsmtUnfSF","Heating","HeatingQC","CentralAir","Electrical","LowQualFinSF","BsmtFullBath","BsmtHalfBath","HalfBath","BedroomAbvGr","KitchenAbvGr","KitchenQual","Functional","GarageType","GarageYrBlt","GarageFinish","GarageQual","GarageCond","PavedDrive","EnclosedPorch","3SsnPorch","ScreenPorch","MoSold","YrSold","SaleType","SaleCondition"],axis=1,inplace=True)

    df_temp = pd.get_dummies(df_temp)
    
    df_train = df_temp[df_temp["evel_set"]==0]
    df_test = df_temp[df_temp["evel_set"]==1]
    df_train.drop("evel_set",axis=1,inplace=True)
    df_test.drop("evel_set",axis=1,inplace=True)
    del df_temp
    return df_train,df_test

def model_create(train):
    sc_x = StandardScaler()
    sc_y = StandardScaler()
    x_train = train
    y_train = train["SalePrice"]
    x_train.drop("SalePrice",axis=1,inplace=True)
    x_train.drop("Id",axis=1,inplace=True)

    x_train_std = sc_x.fit_transform(x_train)
    y_train_std = sc_y.fit_transform(y_train[:,np.newaxis]).flatten()

    gbrt = GradientBoostingRegressor(n_estimators=1000, learning_rate=.03, max_depth=3, max_features=.04, min_samples_split=4,
                                    min_samples_leaf=3, loss='huber', subsample=1.0, random_state=0)
    cv_gbrt = rmse_cv(gbrt,x_train_std, y_train_std)
    gbrt.fit(x_train_std, y_train_std)
    print('GradientBoosting CV score min: ' + str(cv_gbrt.min()) + ' mean: ' + str(cv_gbrt.mean()) 
        + ' max: ' + str(cv_gbrt.max()) )

    return gbrt,sc_y


def rmse_cv(model,X,y):
    rmse = np.sqrt(-cross_val_score(model, X, y, scoring = "neg_mean_squared_error", cv = 5))
    return(rmse)


if __name__ == "__main__":
    df_train_org = pd.read_csv("~/kaggle_train/House_Prices_Advanced/train.csv")
    df_test_org = pd.read_csv("~/kaggle_train/House_Prices_Advanced/test.csv")

    df_train,df_test = feature_create(df_train_org,df_test_org)
    del df_train_org,df_test_org
    
    model,scaler = model_create(df_train)
    sc_x = StandardScaler()
    df_test_Id = df_test["Id"]
    df_test = df_test.drop("Id",axis=1)
    df_test.drop("SalePrice",axis=1,inplace=True)

    df_test_std = sc_x.fit_transform(df_test)
    pred = model.predict(df_test_std)
    pred = scaler.inverse_transform(pred)
    df_sub_pred = pd.DataFrame(pred).rename(columns={0:"SalePrice"})
    df_submit = pd.DataFrame({
        "Id": df_test_Id,
        "SalePrice": df_sub_pred["SalePrice"]
    })
    df_submit.to_csv('submission.csv', index=False)

2.1.feature_create As mentioned above, I used the method of checking the code with the Jyupter notebook and pasting the successful code into this function. The Jyupter notebook I made locally while working was like this. House_Prices_Advanced_Regression_Techniques

First of all, it is the combination of training data and test data, but many people combine it as it is, and the division of training data and test data is divided by the number of cases as it is because the order is guaranteed. However, personally it feels a little unpleasant, so I created a column called "evel_set" and stored the training data as 0 and the test data as 1. Later, I adopted the method of dividing the column by relying on it and then dropping the column.

The data cleaning this time is only the completion of missing data. For category values

  1. Check the type of data that exists and the number of times it appears.
  2. Enter the mode Is repeated. In the case of numerical values, I sometimes enter the average value.

I know that I have to do more, but I would like to make it my next task.

As for the last heatmap, I put the data prepared at the beginning into the heatmap function as it is. However, the correct answer was to pass df_temp.corr () and a DataFrame with only the correlation coefficient to the heatmap function. This was also fitted alone.

Also, in DataFrame, you can see all column names, the number of non-NAN values (that is, the number of NANs) and the type with info (). I knew it for the first time this time.

2.2.model_create

Obtained the prepared training data, put all the training data in the objective variable "SalePrice" y_train once, and dropped the unnecessary "Id" and the objective variable "SalePrice" as dependent variables. Next is standardization. This time I used Standard Scaler. At this time, at first, only x_train was standardized and y_train was left as it was. As a result, it did not become a decent evaluation value ... When both x_train and y_train were standardized, it became a decent value. This was a pitfall.

For modeling this time, I'm using the code I picked up as it is. Actually, I would have to find the parameters by myself by grid search etc., but this time I folded it.

2.3.if name == "main": The data itself is read and passed to the feature_create function. Pass the returned edited data to the next model_create to get a StandardScaler instant that returns the trained model and the standardized objective variable. Then edit the forecast data. Since the Id of the test data is necessary for submit, it is not necessary for saving and prediction, so drop it. Also, "Sale Price", which should not be true when first joined, is also made as a column, so drop it as well. Isn't StandardScaler a specification that allows inverse_transform only for data in the same data format after fitting once? This time, I did a slightly clunky way of realizing it by getting an instance of fit_transformed y_train. Since model.predict returns a one-dimensional array, can another instance be able to inverse_transform after fitting an appropriate one-dimensional array? I would like to try it again (rather, I should check how to use Standard Scaler correctly). Convert the predicted value to a DataFrame with a column name. Create a DataFrame for Submit. Create it as it is with "Sale Price" of DataFrame that contains the Id and predicted value of the test data saved earlier. The lineup is guaranteed, so it's a way to make it. The rest is output with to_csv and the end.

3. Result

EfS3QVZUEAAiGsD.jpg The result was that. Since it is 1638th out of 5222, I feel like I am participating properly, so I think I can get a participation prize w The result of [Housing Prices Competition for Kaggle Learn Users] submitted 8 months ago was around 43000 out of 43230, so it seems that it is better not to do anything, so I think that it has made a lot of progress from that point.

3.1. Future issues

Since the classification is touched on by Titanic, I would like to work on time series data. After that, I would like to perform EDA in earnest and be able to adjust data and add new features. Then I would like to move on to Deep Learning. But it may be better to go to data analysis in terms of work, but w

Recommended Posts

House Prices: Advanced Regression Techniques
Challenge Kaggle [House Prices]
Kaggle House Prices ③ ~ Forecast / Submission ~
Supervised learning (regression) 2 Advanced edition
Kaggle House Prices ① ~ Feature Engineering ~