Introduction

Previous, I implemented it by linear regression, but this time I implemented it using non-linearity.

I continued to implement data preprocessing by referring to this article. "Data Preprocessing"-Kaggle Popular Tutorial

Create a model

① Linear Regression ② Ridge regression (Ridge) ③ Support vector machine regression (SVR) ④ RandomForestRegressor I made a model for these four.

#Explanatory variables and objective variables
x = df_train[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
y = df_train['SalePrice']

#Import module
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
#Separate training data and test data
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Create a function to calculate the mean square error

def calc_model(model):
    #Train the model
    model.fit(X_train, y_train)
    # X_Predicted value for test
    pred_y = model.predict(X_test)
    #Get mean square error
    score = mean_squared_error(y_test, pred_y)
    return score

Linear regression

#For linear regression
from sklearn.linear_model import LinearRegression
#Build a model
lr = LinearRegression()
#Calculate mean square error
lr_score = calc_model(lr)
lr_score

# >>>output
0.02824050462867693

Ridge regression

#At the time of Ridge regression
from sklearn.linear_model import Ridge
#Build a model
ridge = Ridge()
#Calculate mean square error
ridge_score = calc_model(ridge)
ridge_score

# >>>output
0.028202963714955512

Support vector machine regression

#Support vector machine regression
from sklearn.svm import SVR
#Build a model
svr = SVR()
#Calculate mean square error
svr_score = calc_model(svr)
svr_score

# >>>output
0.08767857928794534

Random forest regression

#During random forest regression
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
#Calculate mean square error
forest_score = calc_model(forest)
forest_score

# >>>output
0.03268455739481754

As a result, the mean square error of nonlinear regression was large.

Test data preprocessing

Presence or absence of missing values

#Test data preprocessing
#Extract the value of Id
df_test_index = df_test['Id']
#Confirmation of missing values
df_test = df_test[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
df_test.isnull().sum()

# >>>output
OverallQual    0
YearBuilt      0
TotalBsmtSF    1
GrLivArea      0
dtype: int64

Complement the missing value of TotalBsmtSF with the average value.

#Complement missing values with mean values
df_test['TotalBsmtSF'] = df_test['TotalBsmtSF'].fillna(df_test['TotalBsmtSF'].mean())
#Check for missing values
df_test.isnull().sum()

# >>>output
OverallQual    0
YearBuilt      0
TotalBsmtSF    0
GrLivArea      0
dtype: int64

There are no missing values.

Output to CSV file

#Fit the model
pred_y = ridge.predict(df_test)

#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
                          'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission.csv', index=False)

The result was 0.17184, and the result did not increase.

[PYTHON] Kaggle ~ House Price Forecast ② ~