[For super beginners] Python environment construction & scraping & machine learning & practical application that you can enjoy by moving with copy [Let's find a good rental property with SUUMO! ]

Introduction

Everyone, do you like __data analysis __?

Nice to meet you! My name is @ haraso_1130 and I am a mentor at DMM WEBCAMP.

Suddenly, look at the image below.

スクリーンショット 2019-12-15 18.13.43.png

5DK 80,000 yen for a property in the 23 wards of Tokyo! ??

If you're in the 23 wards, you'll be fine with 80,000 studios a month ...

This property is my own ** Using "Python" ** Collect data by ** "scraping" ** Results of data analysis with ** "machine learning" ** It is a property that I was able to discover.

This article is (basically) ** to actually analyze data with just copy and paste and make you like data analysis **. In other words, the goal is to get the impression of "I can do something amazing!" **.

So, if you read something like "I don't know what you're saying", think "I'm just not good at explaining the author" and go ahead.

As a intended reader, I think as follows. ** ・ People who are interested in data analysis ・ People who avoid data analysis ・ Other people **

From the above, I will not explain the code in depth, but focus on "what are you doing now and why?" **.   Also, although it says "move with copy and paste", I intend to devise so that you can understand the fun of data analysis just by ** reading **!

It's just a promotion, but if you read this article and thought "I want to analyze data!", I wrote a self-study method on My blog. So please refer to it.

What to do in this article

** ・ From Python environment construction to scraping and machine learning implementation ** ** ・ Create a profitable property search table like the image below **: Discover profitable properties using the created model. スクリーンショット 2019-12-17 13.30.58.png

What not to do in this article

・ ** Detailed explanation of the code ** ・ ** Explanation of machine learning algorithm **

This article is just for people to like data analysis. I will omit small and difficult stories as much as possible! (I feel that programming learning is overwhelmingly more efficient in "practice-> theory / basic" than in "theory / basic-> practice" ...)

My environment

macOS High Sierra ver10.13.6 (~~ Don't get tired of it ~~) python3.7.3 Jupyter lab

Caution

The title of the article says ** "It works with copy and paste" **, but when I asked some acquaintances to try it, ** Jupyter Notebook or lab ** should work almost without mistake! * As of December 17, 2019 If it doesn't work, please let me know ... Also, if it doesn't work, it's most likely that the library isn't included. Please install with conda install each time.

table of contents

table of contents
Overall flow
Environment
Scraping
Data analysis
Discovering great deals
Actually...
At the end

Overall flow

First, let's check the overall flow. In this article

  1. ** Environment construction ** for executing code **
  2. Automatically acquire a large amount of information using ** scraping **
  3. Create a ** machine learning model ** based on the acquired information
  4. Predict house prices using the created model **
  5. Compare the predicted value with the actual price ** Find a good deal. ** **

It is a flow.

I don't think it's a good idea to say that it's a predicted value by a model, but it's okay because I'll explain it in the machine learning part below!

Environment

As I mentioned earlier, all the code in this article is intended for Jupyter implementation. Here, we will introduce how to install ** Jupyter lab **.

First, install Anaconda from the link below. 【https://www.anaconda.com/distribution/】

スクリーンショット 2019-12-18 3.48.04.png

afterwards for windows 【https://www.python.jp/install/anaconda/windows/install.html 】 for mac 【https://www.python.jp/install/anaconda/macos/install.html 】

Please refer to to complete the installation of Anaconda.

And recent Anaconda has a jupyter lab from the beginning.

$ jupyter lab

Execute the above code from the command prompt for windows and from the terminal for mac.

Also, if you have Anaconda in the first place

$ conda install -c conda-forge jupyterlab
$ jupyter lab

It's okay.

This is the end of environment construction. It's too easy ...!

Create a folder on your desktop and create a notebook there. スクリーンショット 2019-12-18 3.52.26.png

Scraping

By the way, what is "scraping" in the first place? According to wikipedia

Web scraping is nothing more than the process of automatically collecting information from the WWW.

In other words, ** "automatically collect information from the Internet" **. (~~ That's too good ~~)

In this analysis, we use data on rental properties such as thousands, and in some cases tens of thousands, for one property.

・ Property name ・ Rent ・ Area ・ Floor plan ・ Location (nearest station, distance to the nearest station, detailed address) etc...

I manually type this into Excel thousands of times, tens of thousands of times ... just thinking about it makes me sick. Therefore, data is collected at once by programming.

There is one important note here. ** Be sure to check the terms and conditions of the site before scraping. ** ** Scraping is prohibited on some sites due to issues such as copyright and immunity to the server. Fortunately, in SUUMO Terms of Service

"The user shall not use all content provided through this site beyond the scope of personal personal use stipulated by the copyright law without our prior consent."

It is written only, and this time it will be okay because it is for private use.

Open jupyter,

#url (please enter the URL here)
url = ''

Please enter the url of your favorite ward in the 23 wards of Tokyo between the single quotation marks and execute the code. You can jump to the screen to select the 23 wards from the link below. 【https://suumo.jp/chintai/tokyo/city/ 】 For example, in Bunkyo Ward, https://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13101&cb=0.0&ct=9999999&mb=0&mt=9999999&et=9999999&cn=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4 = 03 & sngz = & po1 = 09 & pc = 50 It takes a lot of time to collect data, so it is recommended to run it before going to bed.

__ * Addition (2019/12/29) __ __ Currently, we are checking cases that do not work in the "recommended order". __ __ It is a symptomatic measure, but it seems to work if sorted by "new arrival order". __ __ I will update it again as soon as the root cause is known. __

In addition, this code can acquire information not only in the 23 wards of Tokyo but also outside of Tokyo. (When I tried it, I was able to get the data of Okayama prefecture.) However, since the data analysis part in the latter half of this article is coded assuming the data of the 23 wards of Tokyo, it often does not work with copy and paste alone. There is.


from bs4 import BeautifulSoup
import urllib3
import re
import requests
import time
import pandas as pd
from pandas import Series, DataFrame

#URL (please enter the URL here)
url = ''

result = requests.get(url)
c = result.content

soup = BeautifulSoup(c)

summary = soup.find("div",{'id':'js-bukkenList'})
body = soup.find("body")
pages = body.find_all("div",{'class':'pagination pagination_set-nav'})
pages_text = str(pages)
pages_split = pages_text.split('</a></li>\n</ol>')
pages_split0 = pages_split[0]
pages_split1 = pages_split0[-3:]
pages_split2 = pages_split1.replace('>','')
pages_split3 = int(pages_split2)

urls = []

urls.append(url)

for i in range(pages_split3-1):
    pg = str(i+2)
    url_page = url + '&page=' + pg
    urls.append(url_page)

names = [] 
addresses = [] 
locations0 = [] 
locations1 = [] 
locations2 = [] 
ages = [] 
heights = [] 
floors = []
rent = [] 
admin = []
others = [] 
floor_plans = [] 
areas = []
detail_urls = [] 

for url in urls:
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)
    summary = soup.find("div",{'id':'js-bukkenList'})
    apartments = summary.find_all("div",{'class':'cassetteitem'})

    for apartment in apartments:

        room_number = len(apartment.find_all('tbody'))

        name = apartment.find('div', class_='cassetteitem_content-title').text
        address = apartment.find('li', class_='cassetteitem_detail-col1').text

        for i in range(room_number):
            names.append(name)
            addresses.append(address)

        sublocation = apartment.find('li', class_='cassetteitem_detail-col2')
        cols = sublocation.find_all('div')
        for i in range(len(cols)):
            text = cols[i].find(text=True)
            for j in range(room_number):
                if i == 0:
                    locations0.append(text)
                elif i == 1:
                    locations1.append(text)
                elif i == 2:
                    locations2.append(text)

        age_and_height = apartment.find('li', class_='cassetteitem_detail-col3')
        age = age_and_height('div')[0].text
        height = age_and_height('div')[1].text

        for i in range(room_number):
            ages.append(age)
            heights.append(height)

        table = apartment.find('table')
        rows = []
        rows.append(table.find_all('tr'))

        data = []
        for row in rows:
            for tr in row:
                cols = tr.find_all('td')
                if len(cols) != 0:
                    _floor = cols[2].text
                    _floor = re.sub('[\r\n\t]', '', _floor)

                    _rent_cell = cols[3].find('ul').find_all('li')
                    _rent = _rent_cell[0].find('span').text
                    _admin = _rent_cell[1].find('span').text

                    _deposit_cell = cols[4].find('ul').find_all('li')
                    _deposit = _deposit_cell[0].find('span').text
                    _reikin = _deposit_cell[1].find('span').text
                    _others = _deposit + '/' + _reikin

                    _floor_cell = cols[5].find('ul').find_all('li')
                    _floor_plan = _floor_cell[0].find('span').text
                    _area = _floor_cell[1].find('span').text

                    _detail_url = cols[8].find('a')['href']
                    _detail_url = 'https://suumo.jp' + _detail_url

                    text = [_floor, _rent, _admin, _others, _floor_plan, _area, _detail_url]
                    data.append(text)

        for row in data:
            floors.append(row[0])
            rent.append(row[1])
            admin.append(row[2])
            others.append(row[3])
            floor_plans.append(row[4])
            areas.append(row[5])
            detail_urls.append(row[6])


        time.sleep(3)

names = Series(names)
addresses = Series(addresses)
locations0 = Series(locations0)
locations1 = Series(locations1)
locations2 = Series(locations2)
ages = Series(ages)
heights = Series(heights)
floors = Series(floors)
rent = Series(rent)
admin = Series(admin)
others = Series(others)
floor_plans = Series(floor_plans)
areas = Series(areas)
detail_urls = Series(detail_urls)

suumo_df = pd.concat([names, addresses, locations0, locations1, locations2, ages, heights, floors, rent, admin, others, floor_plans, areas, detail_urls], axis=1)

suumo_df.columns=['Apartment name','Street address','Location 1','Location 2','Location 3','Age','Building height','hierarchy','Rent','Management fee', 'Shiki/Thank you/Warranty/Shiki引,Amortization','Floor plan','Occupied area', 'Detailed URL']

suumo_df.to_csv('suumo.csv', sep = '\t', encoding='utf-16', header=True, index=False)

If you don't get an error after about 10 seconds, you're successful. Let's wait patiently.

I'll explain what this code is doing. Most web pages are written in the language HTML. Let's check the structure of SUUMO's website using Chrome's verification tool. スクリーンショット 2019-12-17 18.47.06.png

Looking at the above, you can see that the name of the rental is marked with cassetteitem_content-title. In scraping, data is acquired by using the mark information attached to such HTML.

Actually, I want to get all the data of rental information in Tokyo, but it takes a tremendous amount of time, so this time I will focus on only one ward.

Data analysis

Good morning. If you can get the data, you should have a file called suumo.csv in the same directory as the scraped notebook. スクリーンショット 2019-12-17 18.55.05.png The data is as above. There is memory which was impressed with ** "the crust Oh Oh !!! great !!!" ** when he was successful the first time the acquisition of large data on your own, what about everyone.

We will analyze the data using this data, but before that, I will explain a little basic concept of the machine learning model.

Supervised learning

The learning to be performed this time is a learning method (data analysis method) called ** supervised learning **. According to Wikipedia

The name comes from the fact that the data given in advance is regarded as an "example (= advice from the teacher)" and learning (= some fitting to the data) is performed using it as a guide.

... apparently ... It doesn't come out very well, so let's consider a concrete example.

This time, we will create a model that predicts the rental price from the rental information. We generally have the following knowledge about rent:

The larger the area, the higher the rent The closer to the station, the higher the rent The smaller the building age, the higher the rent

The reason I am able to have this knowledge is none other than knowing such cases. You can also predict rent to some extent from this knowledge.

Machine learning is about letting machines do this.

A lot of data is trained by a machine, the rent is output as a predicted value from the area, the distance to the station, the age of the building, etc., and the rent is predicted so that the difference between the predicted value and the actual rent (teacher data) becomes smaller. To do.

Also, as a term The value to be output as a predicted value such as rent is ** objective variable ** Information that characterizes objective variables such as area and age ** Features ** Is called.

Overfitting

The biggest difficulty in machine learning is this ** overfitting **. Overfitting is ** "overfitting to some data" **.

Suppose you have created a model that predicts rent by area. The image below is that image. スクリーンショット 2019-12-19 9.15.36.png

Let's complicate the model to further reduce this error. スクリーンショット 2019-12-19 9.15.42.png

Wow! I can predict the rent perfectly with 100% accuracy !!!! ... and ** I'm not happy. ** ** Let's use this model and apply it to other data with similar distribution. スクリーンショット 2019-12-19 9.18.26.png

The simple model on the left boasts similar performance, while the complex model on the right clearly ** fails to predict **.

The purpose of machine learning is generally ** to create the best performing model for unknown datasets **. The performance of the model for this unknown data set is called ** generalization performance **.

So how do you create a model with high generalization performance? The easiest way is to ** split the data ** as shown in the image below. スクリーンショット 2019-12-18 4.53.29.png Follow the procedure below to measure generalization performance.

  1. Divide the data into training data and test data.
  2. Create a model using ** training data only **.
  3. Using the created model, apply it to ** test data ** and calculate the predicted value.
  4. Measure the performance of the model using the "calculated predicted value" and the "actual test data value (teacher data)"

In this machine learning part, 67% of the collected data is divided into training data, and the remaining 33% is divided into test data.

Pretreatment

In fact, this is the part that makes up most of the data analysis, and data analysis requires the processing of the data into a machine-readable form.

For example, take a look at the age column (the name of the column). スクリーンショット 2019-12-17 19.15.41.png We humans can see this and recognize that ** "The age of the building isnew construction <7 years old <21 years old" **. You might take it for granted, but machines can't recognize this. So ** New construction → 0 7 years old → 7 21 years old → 21 ** It is necessary to process it so that the machine can recognize [0 <7 <21].

In addition, the floor plan of 2LDK etc. is processed using the method of ** one-hot encoding (dummy variable) **, and the nearest station etc. is processed by the method of ** Label encoding **. (Please check if you are interested)

You may also need to deal with missing values. (The missing value complement is a swamp, so I won't touch it here ...)

The processing applied to the data before learning is collectively called ** preprocessing **.

Now let's actually execute the code.

import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn import preprocessing
import pandas_profiling as pdp 

df = pd.read_csv('suumo.csv', sep='\t', encoding='utf-16')

splitted1 = df['Location 1'].str.split('Ayumu', expand=True)
splitted1.columns = ['Location 11', 'Location 12']
splitted2 = df['Location 2'].str.split('Ayumu', expand=True)
splitted2.columns = ['Location 21', 'Location 22']
splitted3 = df['Location 3'].str.split('Ayumu', expand=True)
splitted3.columns = ['Location 31', 'Location 32']

splitted4 = df['Shiki/Thank you/Warranty/Shiki引,Amortization'].str.split('/', expand=True)
splitted4.columns = ['Security deposit', 'key money']

df = pd.concat([df, splitted1, splitted2, splitted3, splitted4], axis=1)

df.drop(['Location 1','Location 2','Location 3','Shiki/Thank you/Warranty/Shiki引,Amortization'], axis=1, inplace=True)

df = df.dropna(subset=['Rent'])

df['Rent'] = df['Rent'].str.replace(u'Ten thousand yen', u'')
df['Security deposit'] = df['Security deposit'].str.replace(u'Ten thousand yen', u'')
df['key money'] = df['key money'].str.replace(u'Ten thousand yen', u'')
df['Management fee'] = df['Management fee'].str.replace(u'Circle', u'')
df['Age'] = df['Age'].str.replace(u'New construction', u'0') 
df['Age'] = df['Age'].str.replace(u'Over 99 years', u'0') #
df['Age'] = df['Age'].str.replace(u'Built', u'')
df['Age'] = df['Age'].str.replace(u'Year', u'')
df['Occupied area'] = df['Occupied area'].str.replace(u'm', u'')
df['Location 12'] = df['Location 12'].str.replace(u'Minutes', u'')
df['Location 22'] = df['Location 22'].str.replace(u'Minutes', u'')
df['Location 32'] = df['Location 32'].str.replace(u'Minutes', u'')

df['Management fee'] = df['Management fee'].replace('-',0)
df['Security deposit'] = df['Security deposit'].replace('-',0)
df['key money'] = df['key money'].replace('-',0)

splitted5 = df['Location 11'].str.split('/', expand=True)
splitted5.columns = ['Route 1', 'Station 1']
splitted5['1 on foot'] = df['Location 12']
splitted6 = df['Location 21'].str.split('/', expand=True)
splitted6.columns = ['Route 2', 'Station 2']
splitted6['2 on foot'] = df['Location 22']
splitted7 = df['Location 31'].str.split('/', expand=True)
splitted7.columns = ['Route 3', 'Station 3']
splitted7['3 on foot'] = df['Location 32']

df = pd.concat([df, splitted5, splitted6, splitted7], axis=1)

df.drop(['Location 11','Location 12','Location 21','Location 22','Location 31','Location 32'], axis=1, inplace=True)

df['Rent'] = pd.to_numeric(df['Rent'])
df['Management fee'] = pd.to_numeric(df['Management fee'])
df['Security deposit'] = pd.to_numeric(df['Security deposit'])
df['key money'] = pd.to_numeric(df['key money'])
df['Age'] = pd.to_numeric(df['Age'])
df['Occupied area'] = pd.to_numeric(df['Occupied area'])

df['Rent'] = df['Rent'] * 10000
df['Security deposit'] = df['Security deposit'] * 10000
df['key money'] = df['key money'] * 10000

df['1 on foot'] = pd.to_numeric(df['1 on foot'])
df['2 on foot'] = pd.to_numeric(df['2 on foot'])
df['3 on foot'] = pd.to_numeric(df['3 on foot'])

splitted8 = df['hierarchy'].str.split('-', expand=True)
splitted8.columns = ['Floor 1', 'Floor 2']
splitted8['Floor 1'].str.encode('cp932')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'Floor', u'')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'B', u'-')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'M', u'')
splitted8['Floor 1'] = pd.to_numeric(splitted8['Floor 1'])
df = pd.concat([df, splitted8], axis=1)

df['Building height'] = df['Building height'].str.replace(u'Underground 1 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'Underground 2 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'Underground 3 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'4 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'5 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'6 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'7 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'8 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'9 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'One-story', u'1')
df['Building height'] = df['Building height'].str.replace(u'Floor', u'')
df['Building height'] = pd.to_numeric(df['Building height'])

df = df.reset_index(drop=True)
df['Floor plan DK'] = 0
df['Floor plan K'] = 0
df['Floor plan L'] = 0
df['Floor plan S'] = 0
df['Floor plan'] = df['Floor plan'].str.replace(u'Studio', u'1') 

for x in range(len(df)):
    if 'DK' in df['Floor plan'][x]:
        df.loc[x,'Floor plan DK'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'DK',u'')

for x in range(len(df)):
    if 'K' in df['Floor plan'][x]:
        df.loc[x,'Floor plan K'] = 1        
df['Floor plan'] = df['Floor plan'].str.replace(u'K',u'')

for x in range(len(df)):
    if 'L' in df['Floor plan'][x]:
        df.loc[x,'Floor plan L'] = 1        
df['Floor plan'] = df['Floor plan'].str.replace(u'L',u'')

for x in range(len(df)):
    if 'S' in df['Floor plan'][x]:
        df.loc[x,'Floor plan S'] = 1        
df['Floor plan'] = df['Floor plan'].str.replace(u'S',u'')

df['Floor plan'] = pd.to_numeric(df['Floor plan'])

splitted9 = df['Street address'].str.split('Ward', expand=True)
splitted9.columns = ['Ward', 'Municipalities']
splitted9['Ward'] = splitted9['Ward'] + 'Ward'
splitted9['Ward'] = splitted9['Ward'].str.replace('Tokyo','')
df = pd.concat([df, splitted9], axis=1)

df_for_search = df.copy()

df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities']] = df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities']].fillna("NAN")

oe = preprocessing.OrdinalEncoder()
df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities']] = oe.fit_transform(df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities']].values) 

df['Rent+Management fee'] = df['Rent'] + df['Management fee']

#Set maximum price
df = df[df['Rent+Management fee'] < 300000]

df = df[["Apartment name",'Rent+Management fee', 'Age', 'Building height', 'Floor 1',
       'Occupied area','Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','1 on foot', '2 on foot','3 on foot','Floor plan', 'Floor planDK', 'Floor planK', 'Floor planL', 'Floor planS',
       'Municipalities']]

df.columns = ['name','real_rent','age', 'hight', 'level','area', 'route_1','route_2','route_3','station_1','station_2','station_3','distance_1','distance_2','distance_3','room_number','DK','K','L','S','adress']


pdp.ProfileReport(df)

I will explain four points. ・ Since it was judged that it is not necessary for analysis, information such as "ward" and "detailed url" is not included. For example, this time I am doing it in Bunkyo Ward, but since all the data is from Bunkyo Ward, that information will not be valuable. However, please note that the information of "ward" is very important when analyzing data of all 23 wards of Tokyo, for example.

・ Information on "security deposit" and "key money" is not included. This is because, contrary to the previous reason, it is too easy to predict (it has no meaning). For most rentals, the security deposit and key money are set at the same level as or two to three times the rent. Even if you are told that the rent for a rent of 70,000 yen will be 70,000 yen, you won't get anything ...

・ The objective variable is "rent + management fee" instead of rent. The actual monthly payment is "rent + management fee". Furthermore, the data of properties whose rent + management cost exceeds 300,000 yen is the third line from the bottom.


df = df[df['Rent+Management fee'] < 300000]

It is excluded by. I analyzed even if there was no upper limit on rent and tried to find a good property, but I said, "The estimated rent for this property is 2 million yen, but in reality it is 1.5 million yen! It is a whopping 500,000 yen a month!" I felt sad when I was told. (I can't even live in a house with a rent of 300,000 ...) It's okay if you change the price yourself.

-Pandas-profiling is used for ** Exploratory Data Analysis (EDA) **. This is a library that I personally like very much.

From the whole information of the data ... スクリーンショット 2019-12-17 20.06.56.png Basic statistic for each variable, スクリーンショット 2019-12-17 20.07.09.png It even displays the correlation coefficient matrix ... Too convenient ...

スクリーンショット 2019-12-17 20.13.33.png

** Addendum (2020/1/7) ** ** Currently, it seems that lightgbm cannot use Japanese for feature names **

Feature creation

Next is the feature creation part. This is also included as preprocessing.

Here, we will create a new feature that seems to work (it seems to be able to explain the explanatory variables well) from the existing features. For example, in the code below ・ Area per room divided by the number of rooms ・ Product of the type of the nearest station and the distance to the nearest station (If it is a 5-minute walk to a minor station and a 5-minute walk to a major station, the latter seems to be more expensive) We have created and added features such as.

If you have any experience with Python, please try to create your own features.


df["per_area"] = df["area"]/df["room_number"]
df["height_level"] = df["height"]*df["level"]
df["area_height_level"] = df["area"]*df["height_level"]
df["distance_staion_1"] = df["station_1"]*df["distance_1"]

Machine learning

It's finally learning! !! !! We will create a price forecast model based on the shaped and created data.

The machine learning algorithm used here is ** lightgbm **. ・ High accuracy ・ Processing is fast It can be said that it is the most major method in competitions that compete for the accuracy of machine learning. (In the offline competition with a tight time limit that I participated in before, all the top 10 people used this algorithm ...!)

First at the terminal or command prompt

conda install -c conda-forge lightgbm

Let's install lightgbm.

Now let's run the code.


import matplotlib.pyplot as plt
import japanize_matplotlib
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = df["real_rent"]
X = df.drop(['real_rent',"name"], axis=1) 

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.33, random_state=0)

lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

lgbm_params = {
        'objective': 'regression',
        'metric': 'rmse',
        'num_leaves':80
}

model = lgb.train(lgbm_params, lgb_train, valid_sets=lgb_eval, verbose_eval=-1)
    
y_pred = model.predict(X_test, num_iteration=model.best_iteration)

print(r2_score(y_test, y_pred)  )
lgb.plot_importance(model, figsize=(12, 6))
plt.show()


Result
スクリーンショット 2019-12-17 20.44.49.png

I will add some explanations here as well.

-The numbers that appear as a result represent the accuracy of this ** model **. This index takes a value from 0 to 1, and the closer it is to ** 1, the better **. From the index of "0.945" that I got in my analysis this time, you can see that this model has high performance. As an image, it can be predicted with an accuracy of about 94%.

-Feature_importance indicates ** the importance of each feature **. After all, you can see that the occupied area and age are important. ~~ I'm glad that the features I made worked quite well. ~~

Value-for-money property search

pred = list(model.predict(X, num_iteration=model.best_iteration))
pred = pd.Series(pred, name="Predicted value")
diff = pd.Series(df["Rent+Management fee"]-pred,name="Difference from the predicted value")
df_search = pd.concat([df_for_search,diff,pred], axis=1)
df_search = df_search.sort_values("Difference from the predicted value")
df_search = df_search[["Apartment name",'Rent+Management fee', 'Predicted value',  'Predicted valueとの差', 'Detailed URL']]
df_search.to_csv('otoku.csv', sep = '\t',encoding='utf-16')

Executing the above code will create a csv file like the one below. This code uses the created model to predict the rent for all properties and create a table where the difference is large, that is, sorted in order of profit. In other words, the more it is on ** in the created csv file, the better it is **.

スクリーンショット 2019-12-17 21.26.35.png For example, let's find out about the property "A-standard Hongo 3-chome", which is said to be the most profitable property in this model. You can fly from the details url on the right. スクリーンショット 2019-12-17 21.27.47.png スクリーンショット 2019-12-17 21.35.16.png

3 minutes walk to the nearest station 2LDK 54㎡ 7 years old 9th floor

With this, 130,000 yen a month is certainly a pretty good condition ... ** I want to live insanely to say the least ... ** Since the predicted value was 230,000 yen, ・ Bunkyo Ward ・ Within 10 years of construction ・ Within 10 minutes walk from the nearest station ・ Area 45㎡ or more When I searched with SUUMO by specifying the conditions of, I found that there are really many properties of about 200,000 to 230,000, and this property is a great deal. (By the way, it was not an accident property)

Please use this table to search for properties!

Actually ...

In this chapter, we will talk about "** You can easily be deceived if you do not study machine learning **". The table I made earlier seems to be extremely effective and wonderful. ** But in fact, 67% of the information in that table is almost meaningless ** Actually, there is a good possibility that there is a property in that data that is more profitable than the property that you think is the most profitable now.

The reason is "Because the model created using the training data is applied to the training data, it is not possible to recognize the profit as a profit."

In the first place, a good-value property is a property that has a large (predicted value)-(actual price (teacher data)). I predicted that the price for the condominium was 230,000 yen, but in reality it is 130,000 yen, which can be said to be a great deal of 100,000 yen. However, if this property is included in the learning data, the predicted value will be calculated as "No, I know it", which is almost the same as the actual price. In short, something like "cheat" is happening. No matter how difficult or special the problem is, if you know the answer, you can solve it. Since the purpose was "Let's find a good deal", this problem is obviously quite bad.

In other words, ** the data you want to predict should not be included in the training data **.

By the way, as a solution, as shown in the image below, you can think of a method of learning and making predictions in several steps. スクリーンショット 2019-12-18 1.27.04.png

It's okay if you think "I don't know what it is". Don't worry, we're talking a bit complicated. In this chapter ** You can easily be fooled without knowledge of data analysis ... ** ** Data analysis is deep ** I am writing it because I want to introduce that.

At the end

This is the end of this article. Thank you so much for staying with us so far! I would be very happy if you could feel the fun and dynamics of data analysis, as well as the horror and depth. Data analysis is important so that you can be happy and not unhappy. Humans are more vulnerable to numbers than they are aware of. Also, if you want to study data analysis with this, Different article introduces the self-study method. Please try to reference.

reference

I used machine learning to find a bargain rental property in the 23 wards of Tokyo Most of the scraping code was based on this person's code. However, there were some parts that did not work with the same copy and paste, so I am fixing it.

Recommended Posts

[For super beginners] Python environment construction & scraping & machine learning & practical application that you can enjoy by moving with copy [Let's find a good rental property with SUUMO! ]
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
Build a machine learning application development environment with Python
Building a Windows 7 environment for getting started with machine learning with Python
Build a Python machine learning environment with a container
Until you create a machine learning environment with Python on Windows 7 and run it
Memo for building a machine learning environment using Python
You can try it with copy! Let's draw a cool network diagram with networkx of Python