[Python] Predict the appropriate rent for an apartment

To determine the appropriate rent for a studio apartment in a certain area Let's build a model in Python that can predict rent with high accuracy from other variables.

Data to use

41 datasets (souba.csv) about studio apartments in a certain area

Data set contents

id: Property id rent: Rent (yen) area: Area (m2m2) age: Age (years) minutes: Station walk (minutes)

Overall flow

  1. Data preparation
  2. Grasp the data from various axes
  3. Feature selection and model comparison
  4. Learning and evaluating the model


1) Data preparation

First, import and confirm the data.

import numpy  as np
import pandas as pd
import seaborn as sns

df = pd.read_csv("souba.csv")

	id	rent	area	age	minutes
0	1	4.3	13.00	26	6
1	2	4.2	14.35	35	5
2	3	5.0	15.60	33	9
3	4	4.5	13.09	32	5
4	5	4.6	12.92	38	7

Deleted meaningless variable id that is not related to analysis.

data = df.drop(["id"], axis = 1)

rent	area	age	minutes
0	4.3	13.00	26	6
1	4.2	14.35	35	5
2	5.0	15.60	33	9
3	4.5	13.09	32	5
4	4.6	12.92	38	7

Divide into training data and test data at 7: 3.

#Divided into training data and test data
X = data[["area", "age", "minutes"]].values
y = data["rent"].values

from sklearn.model_selection import train_test_split, cross_validate, KFold
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)
X_train.shape, X_test.shape

((28, 3), (13, 3))

Create a Dataframe for each of the training data and test data.

#Create a Dataframe
data_train = pd.DataFrame(X_train,columns = ["area", "age", "minutes"])
data_train["rent"] = y_train

data_test = pd.DataFrame(X_test,columns = ["area", "age", "minutes"])
data_test["rent"] = y_test

2) Grasp the data from various axes

First, let's visualize the training data. The purpose of visualization is ① Roughly grasp the relationship of the entire data ② Find outliers



Looking at the relationship between rent and other variables, ・ Area seems to be correlated ・ Minutes (walking from the station) are not as strong as area, but there seems to be a weak correlation ・ Age (age) seems to be close to inverse correlation It can be read that.

There is only one outlier in area, so exclude it.

#Correspondence of outliers
data_train = data_train.query("area < 40")


Next, let's calculate the correlation coefficient.


    area	age	minutes	rent
area	1.000000	-0.148793	0.207465	0.925012
age	-0.148793	1.000000	-0.088959	-0.315480
minutes	0.207465	-0.088959	1.000000	0.154623
rent	0.925012	-0.315480	0.154623	1.000000

Looking at the correlation between rent and each variable, Area has a very high correlation, but age has a weak inverse correlation, and minutes have almost no correlation.

From the above. It seems good to take it as a feature in the order of area> age> minutes.

3) Feature selection and model comparison

【point】 ・ Be sure to select a model using only training data.

First, let's create a linear regression model when the explanatory variables are all variables. Here we use 3-fold cross validation.

X_train = data_train[["area", "age","minutes"]].values
y_train = data_train["rent"].values

LR = LinearRegression()
res_1 = cross_validate(LR, X = X_train, y = y_train, scoring = "r2", cv = KFold(n_splits = 3, shuffle = True), return_train_score = True)

{'fit_time': array([0.        , 0.00254989, 0.00099754]),
 'score_time': array([0.00401378, 0.        , 0.        ]),
 'test_score': array([-0.11587073,  0.67717773,  0.65301948]),
 'train_score': array([0.96132849, 0.66332986, 0.90608152])}

Next, let's create a linear regression model excluding the station walk from the explanatory variables.

X_train = data_train[["area", "age"]].values
y_train = data_train["rent"].values

LR = LinearRegression()
res_2 = cross_validate(LR, X = X_train, y = y_train, scoring = "r2", cv = KFold(n_splits = 3, shuffle = True), return_train_score = True)

{'fit_time': array([0.00318336, 0.0009973 , 0.0010891 ]),
 'score_time': array([0.        , 0.00185704, 0.        ]),
 'test_score': array([0.32138226, 0.87606109, 0.83912145]),
 'train_score': array([0.94496232, 0.88506422, 0.80168397])}

Comparing each test_score, Since the latter model, which excludes the station walk from the explanatory variables, has a higher coefficient of determination, we will use this as the model.

4) Learning and evaluating the model

The model is trained with the features as area and age.

X_train = data_train[["area", "age"]].values
y_train = data_train["rent"].values

LR = LinearRegression()
LR.fit(X_train, y_train)

Put test data in the created model and verify the accuracy.

X_test = data_test[["area", "age"]].values
y_test = data_test["rent"].values

y_pred = LR.predict(X_test)
r2_score(y_true = y_test, y_pred = y_pred) 

#Model evaluation

As a result, the coefficient of determination in the test data was 0.8 or more, so It seems to be a model with a certain degree of accuracy.

Recommended Posts

[Python] Predict the appropriate rent for an apartment
See python for the first time
What is the python underscore (_) for?
An introduction to Python for non-engineers
Command for the current directory Python
Build an environment for Blender built-in Python
The story of making Python an exe
MongoDB for the first time in Python
Pandas of the beginner, by the beginner, for the beginner [Python]
An introduction to Python for machine learning
An introduction to Python for C programmers
[Note] The solution for Python on MacOSX where import hashlib causes an error
The fastest way for beginners to master Python
[Python] I tried substituting the function name for the function name
Created a Python wrapper for the Qiita API
vprof --I tried using the profiler for Python
[Python] matplotlib: Format the diagram for your dissertation
Building an environment for executing Python scripts (for mac)
Building an Anaconda environment for Python with pyenv
Wagtail is the best CMS for Python! (Perhaps)
Upgrade the Azure Machine Learning SDK for Python
Use logger with Python for the time being
I tried python programming for the first time.
Install the python package in an offline environment
Python: Prepare a serializer for the class instance:
[Python] I searched for the longest Pokemon Shiritori
Image processing? The story of starting Python for
Get an access token for the Pocket API
Code for checking the operation of Python Matplotlib
Call Polly from the AWS SDK for Python
An article summarizing the pitfalls addicted to python
2016-10-30 else for Python3> for:
python [for myself]
Get an Access Token for your service account with the Firebase Admin Python SDK
python beginners tried to predict the number of criminals
Build an interactive environment for machine learning in Python
What I got into Python for the first time
Make your python CUI application an app for mac
I tried Python on Mac for the first time.
HoloViews may become the standard for Python visualization tools
Information for controlling the motor with Python on RaspberryPi
Building an environment for natural language processing with Python
Electron is the best solution for Python multi-platform development
Python program that looks for the same file name
Python: Get a list of methods for an object
python memo (for myself): About the development environment virtualenv
python (2) requires self because the method is an instance method