To determine the appropriate rent for a studio apartment in a certain area Let's build a model in Python that can predict rent with high accuracy from other variables.

Data to use

41 datasets (souba.csv) about studio apartments in a certain area

Data set contents

id: Property id rent: Rent (yen) area: Area (m2m2) age: Age (years) minutes: Station walk (minutes)

Overall flow

Data preparation
Grasp the data from various axes
Feature selection and model comparison
Learning and evaluating the model

Practice

1) Data preparation

First, import and confirm the data.

import numpy  as np
import pandas as pd
import seaborn as sns

df = pd.read_csv("souba.csv")
df.head()

	id	rent	area	age	minutes
0	1	4.3	13.00	26	6
1	2	4.2	14.35	35	5
2	3	5.0	15.60	33	9
3	4	4.5	13.09	32	5
4	5	4.6	12.92	38	7

Deleted meaningless variable id that is not related to analysis.

data = df.drop(["id"], axis = 1)
data.head()

rent	area	age	minutes
0	4.3	13.00	26	6
1	4.2	14.35	35	5
2	5.0	15.60	33	9
3	4.5	13.09	32	5
4	4.6	12.92	38	7

Divide into training data and test data at 7: 3.

#Divided into training data and test data
X = data[["area", "age", "minutes"]].values
y = data["rent"].values

from sklearn.model_selection import train_test_split, cross_validate, KFold
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train.shape, X_test.shape

((28, 3), (13, 3))

Create a Dataframe for each of the training data and test data.

#Create a Dataframe
data_train = pd.DataFrame(X_train,columns = ["area", "age", "minutes"])
data_train["rent"] = y_train

data_test = pd.DataFrame(X_test,columns = ["area", "age", "minutes"])
data_test["rent"] = y_test

2) Grasp the data from various axes

First, let's visualize the training data. The purpose of visualization is ① Roughly grasp the relationship of the entire data ② Find outliers

sns.pairplot(data_train)

Looking at the relationship between rent and other variables, ・ Area seems to be correlated ・ Minutes (walking from the station) are not as strong as area, but there seems to be a weak correlation ・ Age (age) seems to be close to inverse correlation It can be read that.

There is only one outlier in area, so exclude it.

#Correspondence of outliers
data_train = data_train.query("area < 40")
sns.pairplot(data_train)

Next, let's calculate the correlation coefficient.

data_train.corr()

　　　　area	age	minutes	rent
area	1.000000	-0.148793	0.207465	0.925012
age	-0.148793	1.000000	-0.088959	-0.315480
minutes	0.207465	-0.088959	1.000000	0.154623
rent	0.925012	-0.315480	0.154623	1.000000

Looking at the correlation between rent and each variable, Area has a very high correlation, but age has a weak inverse correlation, and minutes have almost no correlation.

From the above. It seems good to take it as a feature in the order of area> age> minutes.

3) Feature selection and model comparison

【point】・ Be sure to select a model using only training data.

First, let's create a linear regression model when the explanatory variables are all variables. Here we use 3-fold cross validation.

X_train = data_train[["area", "age","minutes"]].values
y_train = data_train["rent"].values

LR = LinearRegression()
res_1 = cross_validate(LR, X = X_train, y = y_train, scoring = "r2", cv = KFold(n_splits = 3, shuffle = True), return_train_score = True)
res_1

#Score
{'fit_time': array([0.        , 0.00254989, 0.00099754]),
 'score_time': array([0.00401378, 0.        , 0.        ]),
 'test_score': array([-0.11587073,  0.67717773,  0.65301948]),
 'train_score': array([0.96132849, 0.66332986, 0.90608152])}

Next, let's create a linear regression model excluding the station walk from the explanatory variables.

X_train = data_train[["area", "age"]].values
y_train = data_train["rent"].values

LR = LinearRegression()
res_2 = cross_validate(LR, X = X_train, y = y_train, scoring = "r2", cv = KFold(n_splits = 3, shuffle = True), return_train_score = True)
res_2

#Score
{'fit_time': array([0.00318336, 0.0009973 , 0.0010891 ]),
 'score_time': array([0.        , 0.00185704, 0.        ]),
 'test_score': array([0.32138226, 0.87606109, 0.83912145]),
 'train_score': array([0.94496232, 0.88506422, 0.80168397])}

Comparing each test_score, Since the latter model, which excludes the station walk from the explanatory variables, has a higher coefficient of determination, we will use this as the model.

4) Learning and evaluating the model

The model is trained with the features as area and age.

X_train = data_train[["area", "age"]].values
y_train = data_train["rent"].values

LR = LinearRegression()
LR.fit(X_train, y_train)

Put test data in the created model and verify the accuracy.

X_test = data_test[["area", "age"]].values
y_test = data_test["rent"].values

y_pred = LR.predict(X_test)
r2_score(y_true = y_test, y_pred = y_pred) 

#Model evaluation
0.8623439786705054

As a result, the coefficient of determination in the test data was 0.8 or more, so It seems to be a model with a certain degree of accuracy.

[Python] Predict the appropriate rent for an apartment