[PYTHON] Predicting Home Prices (Regression by Linear Regression (kaggle)) ver1.0

1 Introduction

This time, we will solve the Sale Price forecasting problem. At first, I would like to make a prediction from a very simple first-order regression equation. Originally, it is the real pleasure to process and optimize a large number of features, but I would like to start by making a simple prediction.

Reference URL https://www.kaggle.com/katotaka/kaggle-prediction-house-prices

The version used is here.

Python 3.7.6 numpy 1.18.1 pandas 1.0.1 matplotlib 3.1.3 scikit-learn 0.22.1

2 About the program

Library etc.


import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression
import seaborn as sns

#Settings for inline display in Jupyter Notebook (without this, the graph will open in a separate window)
%matplotlib inline 

I imported pandas for loading csv, numpy for processing sequences, matplotlib and seaborn for graph drawing, and sklern.linear_model for regression.

Reading training data


df = pd.read_csv("train.csv")
df

001.png

It is not possible to display all at once due to the large amount of features, but many housing conditions (area, facing the road, having a pool), etc. are listed. It evaluates and predicts whether these conditions affect the sale price.

Identification of features with high correlation coefficient


corrmat = df.corr()
k = 10 #number of variables for heatmap
cols = corrmat.nlargest(k, 'SalePrice')['SalePrice'].index
cm = np.corrcoef(df[cols].values.T)
sns.set(font_scale=1.25)
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

002.png

I would like to find a feature with a high correlation coefficient with respect to the house price. Let's take a look at the seaborn heatmap. It can be seen that the feature quantity with a high correlation coefficient with SalePrice is Overall Qual (overall quality). It's easy to understand that the higher the quality, the higher the price.

Regression analysis and scatter plot display


X = df[["OverallQual"]].values
y = df["SalePrice"].values
slr = LinearRegression()
slr.fit(X,y)

#Scatter plot creation
plt.scatter(X,y)
plt.xlabel('OverallQual')
plt.ylabel('House Price($)')

#Display of approximate curve
plt.plot(X, slr.predict(X), color='red')

#graph display
plt.show()

003.png

I made a graph of the relationship between Overall Qual and Sale Price. The general trend is correct. However, where the Overall Qual is low, it is underestimated. Also, it can be seen that there is a large variation where the Overall Qual is high. I think that these can be predicted more precisely by other features, but this time we will predict them as they are.

Forecast


#Read test data
df_test = pd.read_csv('test.csv')

#Set the Overall Qual value of the test data to X_Set to test
X_test = df_test[["OverallQual"]].values

y_test_pred = slr.predict(X_test)
df_test[["Id", "SalePrice"]].to_csv("submission.csv", index = False)

The SCORE when sent to kaggle was 0.84342 (out of 4720 teams in 4563th place). From the next article, I would like to analyze it in detail and make a good score.

Recommended Posts

Predicting Home Prices (Regression by Linear Regression (kaggle)) ver1.0
Linear regression
Predicting house prices (data preprocessing: first half) ver1.1
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)
[TensorFlow] Least squares linear regression by gradient descent (stochastic descent)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 3: Preparation for missing value complementation)