Introduction

In the next step of the Titanic, I tried to predict the house price, which is an introduction to kaggle. Titanic had quite a few articles, but the house prices were low so I'll post them. Since I am a beginner, the score was low, so I would appreciate it if you could give me some advice.

Data preprocessing was performed with reference to this article. "Data Preprocessing"-Kaggle Popular Tutorial

Model building

This time, since it is a regression analysis, I will try linear regression, Lasso regression, and Ridge regression.

#Prepare training data
X_train = df_train[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]
y_train = df_train['SalePrice']

#Training data Separate by test data
from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split(
    X_train, y_train, random_state=42)

__ Model building __

#Linear regression
#Import module
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

#Linear regression
lr = LinearRegression()
lr.fit(train_X, train_y)
print("Linear regression:{}".format(lr.score(test_X, test_y)))

#Lasso return
lasso = Lasso()
lasso.fit(train_X, train_y)
print("Lasso return:{}".format(lasso.score(test_X, test_y)))

#Ridge regression
ridge = Ridge()
ridge.fit(train_X, train_y)
print("Ridge regression:{}".format(ridge.score(test_X, test_y)))

The result is as follows __ Linear regression: 0.8320945695605152__ __ Lasso Return: 0.5197737962239536__ __ Ridge regression: 0.8324316647361567__

Test data preprocessing

Data reading

#Read test data
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')

output


Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	...	ScreenPorch	PoolArea	PoolQC	Fence	MiscFeature	MiscVal	MoSold	YrSold	SaleType	SaleCondition
0	1461	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	...	120	0	NaN	MnPrv	NaN	0	6	2010	WD	Normal
1	1462	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal
2	1463	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal
3	1464	60	RL	78.0	9978	Pave	NaN	IR1	Lvl	AllPub	...	0	0	NaN	NaN	NaN	0	6	2010	WD	Normal
4	1465	120	RL	43.0	5005	Pave	NaN	IR1	HLS	AllPub	...	144	0	NaN	NaN	NaN	0	1	2010	WD	Normal
5 rows × 80 columns

__ Check for missing values __

#Check for missing values
df_test[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']].isnull().sum()

output

OverallQual    0
YearBuilt      0
TotalBsmtSF    1
GrLivArea      0
dtype: int64

There is a missing value in TotalBsmtSF (underground area). This time, the average value is used to supplement the defect.

#Complement missing values
df_test['TotalBsmtSF'] = df_test['TotalBsmtSF'].fillna(df_test['TotalBsmtSF'].mean())

__ Perform the remaining preprocessing __

#Extract ID
df_test_index = df_test['Id']

#Logarithmic conversion
df_test['GrLivArea'] = np.log(df_test['GrLivArea'])
#Convert categorical variables
df_test = pd.get_dummies(df_test)
#Enter a value for the missing value
df_test[df_test['TotalBsmtSF'].isnull()] 

X_test = df_test[['OverallQual', 'YearBuilt', 'TotalBsmtSF', 'GrLivArea']]

Fit the model

__ Linear regression __

#Linear regression
#Predicted value
pred_y = lr.predict(X_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
                          'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission_lr.csv', index=False)

__ Lasso return __

#Lasso return
#Predicted value
pred_y = lasso.predict(X_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
                          'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission_lasso.csv', index=False)

__ Ridge regression __

#Ridge regression
#Predicted value
pred_y = ridge.predict(X_test)
#Creating a data frame
submission = pd.DataFrame({'Id': df_test_index,
                          'SalePrice': np.exp(pred_y)})
#Output to CSV file
submission.to_csv('submission_ridge.csv', index=False)

At ridge regression The result is 0.16450 (lower is better)

So how do you improve your score?

Next time I will try another tutorial.

[PYTHON] Kaggle ~ Home Price Forecast ~

Introduction

Model building

Test data preprocessing

Fit the model