To determine the appropriate rent for a studio apartment in a certain area Let's build a model in Python that can predict rent with high accuracy from other variables.
41 datasets (souba.csv) about studio apartments in a certain area
id: Property id rent: Rent (yen) area: Area (m2m2) age: Age (years) minutes: Station walk (minutes)
First, import and confirm the data.
import numpy as np
import pandas as pd
import seaborn as sns
df = pd.read_csv("souba.csv")
df.head()
id rent area age minutes
0 1 4.3 13.00 26 6
1 2 4.2 14.35 35 5
2 3 5.0 15.60 33 9
3 4 4.5 13.09 32 5
4 5 4.6 12.92 38 7
Deleted meaningless variable id that is not related to analysis.
data = df.drop(["id"], axis = 1)
data.head()
rent area age minutes
0 4.3 13.00 26 6
1 4.2 14.35 35 5
2 5.0 15.60 33 9
3 4.5 13.09 32 5
4 4.6 12.92 38 7
Divide into training data and test data at 7: 3.
#Divided into training data and test data
X = data[["area", "age", "minutes"]].values
y = data["rent"].values
from sklearn.model_selection import train_test_split, cross_validate, KFold
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train.shape, X_test.shape
((28, 3), (13, 3))
Create a Dataframe for each of the training data and test data.
#Create a Dataframe
data_train = pd.DataFrame(X_train,columns = ["area", "age", "minutes"])
data_train["rent"] = y_train
data_test = pd.DataFrame(X_test,columns = ["area", "age", "minutes"])
data_test["rent"] = y_test
First, let's visualize the training data. The purpose of visualization is ① Roughly grasp the relationship of the entire data ② Find outliers
sns.pairplot(data_train)
Looking at the relationship between rent and other variables, ・ Area seems to be correlated ・ Minutes (walking from the station) are not as strong as area, but there seems to be a weak correlation ・ Age (age) seems to be close to inverse correlation It can be read that.
There is only one outlier in area, so exclude it.
#Correspondence of outliers
data_train = data_train.query("area < 40")
sns.pairplot(data_train)
Next, let's calculate the correlation coefficient.
data_train.corr()
area age minutes rent
area 1.000000 -0.148793 0.207465 0.925012
age -0.148793 1.000000 -0.088959 -0.315480
minutes 0.207465 -0.088959 1.000000 0.154623
rent 0.925012 -0.315480 0.154623 1.000000
Looking at the correlation between rent and each variable, Area has a very high correlation, but age has a weak inverse correlation, and minutes have almost no correlation.
From the above. It seems good to take it as a feature in the order of area> age> minutes.
【point】 ・ Be sure to select a model using only training data.
First, let's create a linear regression model when the explanatory variables are all variables. Here we use 3-fold cross validation.
X_train = data_train[["area", "age","minutes"]].values
y_train = data_train["rent"].values
LR = LinearRegression()
res_1 = cross_validate(LR, X = X_train, y = y_train, scoring = "r2", cv = KFold(n_splits = 3, shuffle = True), return_train_score = True)
res_1
#Score
{'fit_time': array([0. , 0.00254989, 0.00099754]),
'score_time': array([0.00401378, 0. , 0. ]),
'test_score': array([-0.11587073, 0.67717773, 0.65301948]),
'train_score': array([0.96132849, 0.66332986, 0.90608152])}
Next, let's create a linear regression model excluding the station walk from the explanatory variables.
X_train = data_train[["area", "age"]].values
y_train = data_train["rent"].values
LR = LinearRegression()
res_2 = cross_validate(LR, X = X_train, y = y_train, scoring = "r2", cv = KFold(n_splits = 3, shuffle = True), return_train_score = True)
res_2
#Score
{'fit_time': array([0.00318336, 0.0009973 , 0.0010891 ]),
'score_time': array([0. , 0.00185704, 0. ]),
'test_score': array([0.32138226, 0.87606109, 0.83912145]),
'train_score': array([0.94496232, 0.88506422, 0.80168397])}
Comparing each test_score, Since the latter model, which excludes the station walk from the explanatory variables, has a higher coefficient of determination, we will use this as the model.
The model is trained with the features as area and age.
X_train = data_train[["area", "age"]].values
y_train = data_train["rent"].values
LR = LinearRegression()
LR.fit(X_train, y_train)
Put test data in the created model and verify the accuracy.
X_test = data_test[["area", "age"]].values
y_test = data_test["rent"].values
y_pred = LR.predict(X_test)
r2_score(y_true = y_test, y_pred = y_pred)
#Model evaluation
0.8623439786705054
As a result, the coefficient of determination in the test data was 0.8 or more, so It seems to be a model with a certain degree of accuracy.
Recommended Posts