[PYTHON] I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning

Introduction

Since I started studying machine learning, I've been able to input various knowledge, so I'll do it with kaggle! I tried, but I was frustrated. I have no idea how to deal with it! !! I was in a state and my motivation for learning was lowered.

I found the following article when I was looking for something interesting because I thought it would be useless as it is! [For super beginners] Python environment construction & scraping & machine learning & practical application that you can enjoy by moving with copy [Let's find a good rental property with SUUMO! ]

I was looking for a rental property recently, so I thought it was just right and tried it.

Implemented with reference to the article. I will introduce it because I made various improvements in my own way.

What kind of person is it for?

For self-proclaimed machine learning beginners like me. This is for people who have input various things but do not know what to do after that. I haven't explained the basic terms and methods of machine learning, so don't be afraid.

My environment

windous10 Home python3.7.3 jupyter notebook (for testing)

Source code https://github.com/pattatto/scraping

Improved

Let me give you an overview first. The directory structure is as follows. image.png

As a series of flow

  1. Data acquisition by scraping (suumo_getdata.py)
  2. Data preprocessing (Preprocessing.py)
  3. Feature creation (Feature_value.py)
  4. Model learning (model_lightgbm.py)
  5. Output the prediction result from the trained model (Create_Otoku_data.py)

Originally it was a single piece of code, but we have modularized these. The result of data processing is output in CSV each time and called every time it is used.

Scraping

suumo_getdata.py


from bs4 import BeautifulSoup
import urllib3
import re
import requests
import time
import pandas as pd
from pandas import Series, DataFrame

url = input()

result = requests.get(url)
c = result.content

soup = BeautifulSoup(c)

#I want to get the total number of pages
summary = soup.find("div",{'id':'js-bukkenList'})
body = soup.find("body")
pages = body.find_all("div",{'class':'pagination pagination_set-nav'})
pages_text = str(pages)
pages_split = pages_text.split('</a></li>\n</ol>')
pages_split0 = pages_split[0]
pages_split1 = pages_split0[-3:]
pages_split2 = pages_split1.replace('>','')#It gets in the way when it is 2 digits>Remove
pages_split3 = int(pages_split2)

urls = []

urls.append(url)

#After the second page, at the end of the url&page=2 is attached
for i in range(pages_split3-1):
    pg = str(i+2)
    url_page = url + '&page=' + pg
    urls.append(url_page)

names = []
addresses = []
buildings = []
locations0 = []
locations1 = []
locations2 = []
ages = []
heights = []
floors = []
rent = []
admin = []
others = []
floor_plans = []
areas = []
detail_urls = []


for url in urls:
    result = requests.get(url)
    c = result.content
    soup = BeautifulSoup(c)
    summary = soup.find("div",{'id':'js-bukkenList'})
    apartments = summary.find_all("div",{'class':'cassetteitem'})

    for apartment in apartments:

        room_number = len(apartment.find_all('tbody'))

        name = apartment.find('div', class_='cassetteitem_content-title').text
        address = apartment.find('li', class_='cassetteitem_detail-col1').text
        building = apartment.find('span', class_='ui-pct ui-pct--util1').text
        #Add as many property names and addresses to the list as there are rooms for each rental
        for i in range(room_number):
            names.append(name)
            addresses.append(address)
            buildings.append(building)

        sublocation = apartment.find('li', class_='cassetteitem_detail-col2')
        cols = sublocation.find_all('div')
        for i in range(len(cols)):
            text = cols[i].find(text=True)
            #Add data to each list as many as the number of rooms
            for j in range(room_number):
                if i == 0:
                    locations0.append(text)
                elif i == 1:
                    locations1.append(text)
                elif i == 2:
                    locations2.append(text)

        age_and_height = apartment.find('li', class_='cassetteitem_detail-col3')
        age = age_and_height('div')[0].text
        height = age_and_height('div')[1].text

        for i in range(room_number):
            ages.append(age)
            heights.append(height)

        table = apartment.find('table')
        rows = []
        rows.append(table.find_all('tr'))#Information for each room

        data = []
        for row in rows:
            for tr in row:
                cols = tr.find_all('td')#td detailed room information
                if len(cols) != 0:
                    _floor = cols[2].text
                    _floor = re.sub('[\r\n\t]', '', _floor)

                    _rent_cell = cols[3].find('ul').find_all('li')
                    _rent = _rent_cell[0].find('span').text#rent
                    _admin = _rent_cell[1].find('span').text#Management fee

                    _deposit_cell = cols[4].find('ul').find_all('li')
                    _deposit = _deposit_cell[0].find('span').text
                    _reikin = _deposit_cell[1].find('span').text
                    _others = _deposit + '/' + _reikin

                    _floor_cell = cols[5].find('ul').find_all('li')
                    _floor_plan = _floor_cell[0].find('span').text
                    _area = _floor_cell[1].find('span').text

                    _detail_url = cols[8].find('a')['href']
                    _detail_url = 'https://suumo.jp' + _detail_url

                    text = [_floor, _rent, _admin, _others, _floor_plan, _area, _detail_url]
                    data.append(text)

        for row in data:
            floors.append(row[0])
            rent.append(row[1])
            admin.append(row[2])
            others.append(row[3])
            floor_plans.append(row[4])
            areas.append(row[5])
            detail_urls.append(row[6])


        time.sleep(3)

names = Series(names)
addresses = Series(addresses)
buildings = Series(buildings)
locations0 = Series(locations0)
locations1 = Series(locations1)
locations2 = Series(locations2)
ages = Series(ages)
heights = Series(heights)
floors = Series(floors)
rent = Series(rent)
admin = Series(admin)
others = Series(others)
floor_plans = Series(floor_plans)
areas = Series(areas)
detail_urls = Series(detail_urls)

suumo_df = pd.concat([names, addresses, buildings, locations0, locations1, locations2, ages, heights, floors, rent, admin, others, floor_plans, areas, detail_urls], axis=1)

suumo_df.columns=['Apartment name','Street address', 'Building type', 'Location 1','Location 2','Location 3','Age','Building height','hierarchy','Rent','Management fee', 'Shiki/Thank you/Warranty/Shiki引,Amortization','Floor plan','Occupied area', 'Detailed URL']

suumo_df.to_csv('suumo.csv', sep = '\t', encoding='utf-16', header=True, index=False)

I am additionally acquiring the building type (apartment, apartment, etc.). When I ran the original code, there were quite a few apartments at the top of the deals. Under the same conditions, it is natural that an apartment is cheaper.

Pretreatment

Preprocessing.py


import pandas as pd
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
from sklearn import preprocessing
import pandas_profiling as pdp

df = pd.read_csv('otokuSearch/data/suumo.csv', sep='\t', encoding='utf-16')

splitted1 = df['Location 1'].str.split('Ayumu', expand=True)
splitted1.columns = ['Location 11', 'Location 12']
splitted2 = df['Location 2'].str.split('Ayumu', expand=True)
splitted2.columns = ['Location 21', 'Location 22']
splitted3 = df['Location 3'].str.split('Ayumu', expand=True)
splitted3.columns = ['Location 31', 'Location 32']

splitted4 = df['Shiki/Thank you/Warranty/Shiki引,Amortization'].str.split('/', expand=True)
splitted4.columns = ['Security deposit', 'key money']

df = pd.concat([df, splitted1, splitted2, splitted3, splitted4], axis=1)

df.drop(['Location 1','Location 2','Location 3','Shiki/Thank you/Warranty/Shiki引,Amortization'], axis=1, inplace=True)

df = df.dropna(subset=['Rent'])

df['Rent'] = df['Rent'].str.replace(u'Ten thousand yen', u'')
df['Security deposit'] = df['Security deposit'].str.replace(u'Ten thousand yen', u'')
df['key money'] = df['key money'].str.replace(u'Ten thousand yen', u'')
df['Management fee'] = df['Management fee'].str.replace(u'Circle', u'')
df['Age'] = df['Age'].str.replace(u'New construction', u'0')
df['Age'] = df['Age'].str.replace(u'Over 99 years', u'0') #
df['Age'] = df['Age'].str.replace(u'Built', u'')
df['Age'] = df['Age'].str.replace(u'Year', u'')
df['Occupied area'] = df['Occupied area'].str.replace(u'm', u'')
df['Location 12'] = df['Location 12'].str.replace(u'Minutes', u'')
df['Location 22'] = df['Location 22'].str.replace(u'Minutes', u'')
df['Location 32'] = df['Location 32'].str.replace(u'Minutes', u'')

df['Management fee'] = df['Management fee'].replace('-',0)
df['Security deposit'] = df['Security deposit'].replace('-',0)
df['key money'] = df['key money'].replace('-',0)

splitted5 = df['Location 11'].str.split('/', expand=True)
splitted5.columns = ['Route 1', 'Station 1']
splitted5['1 walk from the station'] = df['Location 12']
splitted6 = df['Location 21'].str.split('/', expand=True)
splitted6.columns = ['Route 2', 'Station 2']
splitted6['2 on foot from the station'] = df['Location 22']
splitted7 = df['Location 31'].str.split('/', expand=True)
splitted7.columns = ['Route 3', 'Station 3']
splitted7['3 on foot from the station'] = df['Location 32']

df = pd.concat([df, splitted5, splitted6, splitted7], axis=1)

df.drop(['Location 11','Location 12','Location 21','Location 22','Location 31','Location 32'], axis=1, inplace=True)

df['Rent'] = pd.to_numeric(df['Rent'])
df['Management fee'] = pd.to_numeric(df['Management fee'])
df['Security deposit'] = pd.to_numeric(df['Security deposit'])
df['key money'] = pd.to_numeric(df['key money'])
df['Age'] = pd.to_numeric(df['Age'])
df['Occupied area'] = pd.to_numeric(df['Occupied area'])

df['Rent'] = df['Rent'] * 10000
df['Security deposit'] = df['Security deposit'] * 10000
df['key money'] = df['key money'] * 10000

df['1 walk from the station'] = pd.to_numeric(df['1 walk from the station'])
df['2 on foot from the station'] = pd.to_numeric(df['2 on foot from the station'])
df['3 on foot from the station'] = pd.to_numeric(df['3 on foot from the station'])

splitted8 = df['hierarchy'].str.split('-', expand=True)
splitted8.columns = ['Floor 1', 'Floor 2']
splitted8['Floor 1'].str.encode('cp932')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'Floor', u'')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'B', u'-')
splitted8['Floor 1'] = splitted8['Floor 1'].str.replace(u'M', u'')
splitted8['Floor 1'] = pd.to_numeric(splitted8['Floor 1'])
df = pd.concat([df, splitted8], axis=1)

df['Building height'] = df['Building height'].str.replace(u'Underground 1 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'Underground 2 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'Underground 3 above ground', u'')
df['Building height'] = df['Building height'].str.replace(u'4 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'5 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'6 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'7 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'8 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'9 underground', u'')
df['Building height'] = df['Building height'].str.replace(u'One-story', u'1')
df['Building height'] = df['Building height'].str.replace(u'Floor', u'')
df['Building height'] = pd.to_numeric(df['Building height'])

df = df.reset_index(drop=True)
df['Floor plan DK'] = 0
df['Floor plan K'] = 0
df['Floor plan L'] = 0
df['Floor plan S'] = 0
df['Floor plan'] = df['Floor plan'].str.replace(u'Studio', u'1')

for x in range(len(df)):
    if 'DK' in df['Floor plan'][x]:
        df.loc[x,'Floor plan DK'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'DK',u'')

for x in range(len(df)):
    if 'K' in df['Floor plan'][x]:
        df.loc[x,'Floor plan K'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'K',u'')

for x in range(len(df)):
    if 'L' in df['Floor plan'][x]:
        df.loc[x,'Floor plan L'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'L',u'')

for x in range(len(df)):
    if 'S' in df['Floor plan'][x]:
        df.loc[x,'Floor plan S'] = 1
df['Floor plan'] = df['Floor plan'].str.replace(u'S',u'')

df['Floor plan'] = pd.to_numeric(df['Floor plan'])

splitted9 = df['Street address'].str.split('Ward', expand=True)
splitted9.columns = ['Municipalities']
#splitted9['Ward'] = splitted9['Ward'] + 'Ward'
#splitted9['Ward'] = splitted9['Ward'].str.replace('Tokyo','')
df = pd.concat([df, splitted9], axis=1)

splitted10 = df['Station 1'].str.split('bus', expand=True)
splitted10.columns = ['Station 1', 'Bus 1']
splitted11 = df['Station 2'].str.split('bus', expand=True)
splitted11.columns = ['Station 2', 'Bus 2']
splitted12 = df['Station 3'].str.split('bus', expand=True)
splitted12.columns = ['Station 3', 'Bus 3']

splitted13 = splitted10['Bus 1'].str.split('Minutes\(bus stop\)', expand=True)
splitted13.columns = ['Bus time 1', 'Bus stop 1']
splitted14 = splitted11['Bus 2'].str.split('Minutes\(bus stop\)', expand=True)
splitted14.columns = ['Bus time 2', 'Bus stop 2']
splitted15 = splitted12['Bus 3'].str.split('Minutes\(bus stop\)', expand=True)
splitted15.columns = ['Bus time 3', 'Bus stop 3']

splitted16 = pd.concat([splitted10, splitted11, splitted12, splitted13, splitted14, splitted15], axis=1)
splitted16.drop(['Bus 1','Bus 2','Bus 3'], axis=1, inplace=True)

df.drop(['Station 1','Station 2','Station 3'], axis=1, inplace=True)
df = pd.concat([df, splitted16], axis=1)

splitted17 = df['Station 1'].str.split('car', expand=True)
splitted17.columns = ['Station 1', 'Car 1']
splitted18 = df['Station 2'].str.split('car', expand=True)
splitted18.columns = ['Station 2', 'Car 2']
splitted19 = df['Station 3'].str.split('car', expand=True)
splitted19.columns = ['Station 3', 'Car 3']

splitted20 = splitted17['Car 1'].str.split('Minutes', expand=True)
splitted20.columns = ['Car time 1', 'Vehicle distance 1']
splitted21 = splitted18['Car 2'].str.split('Minutes', expand=True)
splitted21.columns = ['Car time 2', 'Vehicle distance 2']
splitted22 = splitted19['Car 3'].str.split('Minutes', expand=True)
splitted22.columns = ['Car time 3', 'Vehicle distance 3']

splitted23 = pd.concat([splitted17, splitted18, splitted19, splitted20, splitted21, splitted22], axis=1)
splitted23.drop(['Car 1','Car 2','Car 3'], axis=1, inplace=True)

df.drop(['Station 1','Station 2','Station 3'], axis=1, inplace=True)
df = pd.concat([df, splitted23], axis=1)

df['Vehicle distance 1'] = df['Vehicle distance 1'].str.replace(u'\(', u'')
df['Vehicle distance 1'] = df['Vehicle distance 1'].str.replace(u'km\)', u'')
df['Vehicle distance 2'] = df['Vehicle distance 2'].str.replace(u'\(', u'')
df['Vehicle distance 2'] = df['Vehicle distance 2'].str.replace(u'km\)', u'')
df['Vehicle distance 3'] = df['Vehicle distance 3'].str.replace(u'\(', u'')
df['Vehicle distance 3'] = df['Vehicle distance 3'].str.replace(u'km\)', u'')

df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']] = df[['Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']].fillna("NAN")
df[['Bus time 1','Bus time 2','Bus time 3',]] = df[['Bus time 1','Bus time 2','Bus time 3']].fillna(0)#If there is a missing value, an error will occur in the calculation of the feature amount, so replace it with 0.
df['Bus time 1'] = df['Bus time 1'].astype(float)
df['Bus time 2'] = df['Bus time 2'].astype(float)
df['Bus time 3'] = df['Bus time 3'].astype(float)

oe = preprocessing.OrdinalEncoder()
df[['Building type', 'Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']] = oe.fit_transform(df[['Building type', 'Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3']].values)

df['Rent+Management fee'] = df['Rent'] + df['Management fee']

df_for_search = df.copy()

#Set maximum price
df = df[df['Rent+Management fee'] < 300000]

df = df[["Apartment name",'Building type', 'Rent+Management fee', 'Age', 'Building height', 'Floor 1',
       'Occupied area','Route 1','Route 2','Route 3', 'Station 1', 'Station 2','Station 3','1 walk from the station', '2 on foot from the station','3 on foot from the station','Floor plan', 'Floor planDK', 'Floor planK', 'Floor planL', 'Floor planS',
       'Municipalities', 'Bus stop 1', 'Bus stop 2', 'Bus stop 3', 'Bus time 1','Bus time 2','Bus time 3']]

df.columns = ['name', 'building', 'real_rent','age', 'hight', 'level','area', 'route_1','route_2','route_3','station_1','station_2','station_3','station_wolk_1','station_wolk_2','station_wolk_3','room_number','DK','K','L','S','adress', 'bus_stop1', 'bus_stop2', 'bus_stop3', 'bus_time1', 'bus_time2', 'bus_time3']


#pdp.ProfileReport(df)
df.to_csv('otokuSearch/Preprocessing/Preprocessing.csv', sep = '\t', encoding='utf-16', header=True, index=False)
df_for_search.to_csv('otokuSearch/Preprocessing/df_for_search.csv', sep = '\t', encoding='utf-16', header=True, index=False)

The processing was quite difficult here. The improved column contains station information. Originally it was only station information and distance from the station. However, the station information still included the travel time by bus stop and bus, and the time by car to the nearest station.

Like this ** Kawaguchi Station Bus 9 minutes (bus stop) Motogo Junior High School 1 minute walk ** When this data is preprocessed

Station 1 1 on foot
Kawaguchi Station Bus 9 minutes(bus stop)Motogo Junior High School 1

In the column of station 1 Kawaguchi Station Bus 9 minutes (bus stop) Motogo Junior High School On foot 1 1 minute) It will be a 1-minute walk from the next station. Moreover, since the bus information is included in the station name, the station information will be different from Kawaguchi station when Label encoding later. So I divide this into ** bus stop **, ** bus time **, ** car travel time **.

Also, the added building type and bus stop are Label encoded.

Feature creation

Feature_value.py


import pandas as pd
import numpy as np

df = pd.read_csv('otokuSearch/Preprocessing/Preprocessing.csv', sep='\t', encoding='utf-16')

df["per_area"] = df["area"]/df["room_number"]
df["hight_level"] = df["hight"]*df["level"]
df["area_hight_level"] = df["area"]*df["hight_level"]
df["distance_staion_1"] = df["station_1"]*df["station_wolk_1"]+df["bus_stop1"]*df["bus_time1"]
df["distance_staion_2"] = df["station_2"]*df["station_wolk_2"]+df["bus_stop2"]*df["bus_time2"]
df["distance_staion_3"] = df["station_3"]*df["station_wolk_3"]+df["bus_stop3"]*df["bus_time3"]

df.to_csv('otokuSearch/Featurevalue/Fettur_evalue.csv', sep = '\t', encoding='utf-16', header=True, index=False)

We have created a new feature of the distance to the station using the data of the newly created bus.

At first, I created such a feature amount. df["per_real_rent"] = df["real_rent"]/df["area"] However, when I looked closely, I deleted it because it contained the objective variable (predicted rent information). Later, I will visualize the importance of each feature during learning. At first, I was delighted that I was able to obtain a good amount of features because the importance was outstanding. .. ..

Model learning

model_lightgbm.py


#Data analysis library
import pandas as pd
import numpy as np

#Data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

#Random forest library
import lightgbm as lgb

#Library to separate training data and model evaluation data for cross-validation
from sklearn.model_selection import KFold

#Library required for function processing
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
#Library required to save the model
import pickle

#A function that describes the predicted and correct values
def True_Pred_map(pred_df):
    RMSE = np.sqrt(mean_squared_error(pred_df['true'], pred_df['pred']))
    R2 = r2_score(pred_df['true'], pred_df['pred'])
    plt.figure(figsize=(8,8))
    ax = plt.subplot(111)
    ax.scatter('true', 'pred', data=pred_df)
    ax.set_xlabel('True Value', fontsize=15)
    ax.set_ylabel('Pred Value', fontsize=15)
    ax.set_xlim(pred_df.min().min()-0.1 , pred_df.max().max()+0.1)
    ax.set_ylim(pred_df.min().min()-0.1 , pred_df.max().max()+0.1)
    x = np.linspace(pred_df.min().min()-0.1, pred_df.max().max()+0.1, 2)
    y = x
    ax.plot(x,y,'r-')
    plt.text(0.1, 0.9, 'RMSE = {}'.format(str(round(RMSE, 5))), transform=ax.transAxes, fontsize=15)
    plt.text(0.1, 0.8, 'R^2 = {}'.format(str(round(R2, 5))), transform=ax.transAxes, fontsize=15)


df = pd.read_csv('otokuSearch/Featurevalue/Fettur_evalue.csv', sep='\t', encoding='utf-16')

#kf :A box that specifies the behavior of data partitioning. This time there is 10 divisions and data shuffle.
kf = KFold(n_splits=10, shuffle=True, random_state=1)

#predicted_df :Make empty data frames easier when combining each predicted value from now on
predicted_df = pd.DataFrame({'index':0, 'pred':0}, index=[1])

#The parameters have not been adjusted
lgbm_params = {
        'objective': 'regression',
        'metric': 'rmse',
        'num_leaves':80
}

#Cross-validation is performed in 4 divisions, so the loop is repeated 10 times.
#Give kf an index and ask them to determine the training data index and the evaluation data index.
#df,Train the index number of the training data and the index number of the evaluation data used for the first time from the index_index, val_Output to index
for train_index, val_index in kf.split(df.index):

    #Divide into training data, evaluation data & explanatory variables, and objective variables using training data index and evaluation data index
    X_train = df.drop(['real_rent','name'], axis=1).iloc[train_index]
    y_train = df['real_rent'].iloc[train_index]
    X_test = df.drop(['real_rent','name'], axis=1).iloc[val_index]
    y_test = df['real_rent'].iloc[val_index]

    #Process into a dataset for speeding up LightGBM
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_test, y_test)

    #LightGBM model building
    gbm = lgb.train(lgbm_params,
                lgb_train,
                valid_sets=(lgb_train, lgb_eval),
                num_boost_round=10000,
                early_stopping_rounds=100,
                verbose_eval=50)

    #Put the explanatory variables for evaluation in the model and output the predicted value
    predicted = gbm.predict(X_test)

    #temp_df :Combine the predicted value with the original index to match the predicted value with the correct answer value
    temp_df = pd.DataFrame({'index':X_test.index, 'pred':predicted})

    #predicted_df :Temp to empty dataframe_Combine df → In the loop after the second lap, predicted_Temp to df (content ants)_df join
    predicted_df = pd.concat([predicted_df, temp_df], axis=0)

predicted_df = predicted_df.sort_values('index').reset_index(drop=True).drop(index=[0]).set_index('index')
predicted_df = pd.concat([predicted_df, df['real_rent']], axis=1).rename(columns={'real_rent' : 'true'})

True_Pred_map(predicted_df)

print(r2_score(y_test, predicted)  )
lgb.plot_importance(gbm, figsize=(12, 6))
plt.show()

#Save model
with open('otokuSearch/model/model.pickle', mode='wb') as fp:
    pickle.dump(gbm, fp)

As mentioned in the original article, the training data and the data you want to predict are almost the same. I'm in a "cheat" state

So I implemented cross-validation. I borrowed the code here. It is explained in a very easy-to-understand manner. https://rin-effort.com/2019/12/31/machine-learning-8/

To briefly explain cross-validation

  1. Divide the training data (10 this time)
  2. Learning using one of them as validation and the rest as training data
  3. Evaluate with validation data
  4. Repeat learning → evaluation by changing validation data by the amount divided
  5. Evaluate the model by averaging those scores

I tried to save the model I created later. It's a hassle to learn each time.

When executed in the same way as the original code, a predicted valid value map and a graph of important features are output. It's quite a correlation! image.png The newly acquired building type does not contribute much. image.png

Profitable property data creation

Create_Otoku_data.py


import pandas as pd
import numpy as np
import lightgbm as lgb
import pickle

#Data reading
df = pd.read_csv('otokuSearch/Featurevalue/Fettur_evalue.csv', sep='\t', encoding='utf-16')

#Loading trained model
with open('otokuSearch/model/model.pickle', mode='rb') as fp:
    gbm = pickle.load(fp)


#Creating profitable property data
y = df["real_rent"]
X = df.drop(['real_rent',"name"], axis=1)
pred = list(gbm.predict(X, num_iteration=gbm.best_iteration))
pred = pd.Series(pred, name="Predicted value")
diff = pd.Series(df["real_rent"]-pred,name="Difference from the predicted value")
df_for_search = pd.read_csv('otokuSearch/Preprocessing/df_for_search.csv', sep='\t', encoding='utf-16')
df_for_search['Rent+Management fee'] = df_for_search['Rent'] + df_for_search['Management fee']
df_search = pd.concat([df_for_search,diff,pred], axis=1)
df_search = df_search.sort_values("Difference from the predicted value")
df_search = df_search[["Apartment name",'Rent+Management fee', 'Predicted value',  'Predicted valueとの差', 'Detailed URL', 'Floor plan', 'Occupied area', 'hierarchy', 'Station 1', '1 on foot', 'Floor planDK', 'Floor planK', 'Floor planL']]
df_search.to_csv('otokuSearch/Otoku_data/otoku.csv', sep = '\t',encoding='utf-16')

No major changes We just read the training model and added columns for the output data.

Click here for the first place that shines from the output file! image.png 2 minutes walk from the station! 3LDK! With this, the rent of 82,000 yen is wonderful! However, the route is a little local. .. ..

result

It is now possible to create reasonable property data. However, there is no EDA (Exploratory Data Analysis) at all, so more analysis is needed to improve accuracy. After that, it would be nice to increase the data that is scraped. For example, there are various types such as city gas and air conditioners. There is no end to other things like tuning hyperparameters. Well, I was able to do what I wanted to do, so Yoshi!

After managing various codes, I finally understood the necessity of git. There are various harvests such as using Atom and learning how to use git, I realized that programming is the best way to learn in practice.

from now on

I would like to somehow bring the model created this time using Django into the form of a system. When you enter the URL of the page, it seems that the rent is predicted and the profit is given. I will write an article if I can try it.

Recommended Posts

I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
[Keras] I tried to solve a donut-type region classification problem by machine learning [Study]
Scraping and tabelog ~ I want to find a good restaurant! ~ (Work)
I tried to verify the best way to find a good marriage partner
[For super beginners] Python environment construction & scraping & machine learning & practical application that you can enjoy by moving with copy [Let's find a good rental property with SUUMO! ]
I tried to get an image by scraping
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
Find the ideal property by scraping! A few minutes walk from the property to the destination
In IPython, when I tried to see the value, it was a generator, so I came up with it when I was frustrated.
I tried to move machine learning (ObjectDetection) with TouchDesigner
I tried to make a real-time sound source separation mock with Python machine learning
I tried to compress the image using machine learning
Matching app I tried to take statistics of strong people & tried to create a machine learning model
I tried to verify the yin and yang classification of Hololive members by machine learning
There was a doppelganger, so I tried to distinguish it with artificial intelligence (laughs) (Part 1)
I tried using Tensorboard, a visualization tool for machine learning
I tried machine learning to convert sentences into XX style
I stumbled when I tried to install Basemap, so a memorandum
[Machine learning] I tried to summarize the theory of Adaboost
I tried to divide with a deep learning language model
I tried HR Tech to develop an expert search engine by machine learning in-house meeting information
I tried to find out the difference between A + = B and A = A + B in Python, so make a note
I tried to create a simple credit score by logistic regression.
[Python] I tried to implement stable sorting, so make a note
A memorandum of scraping & machine learning [development technique] by Python (Chapter 4)
I want to create a machine learning service without programming! WebAPI
A memorandum of scraping & machine learning [development technique] by Python (Chapter 5)
[Machine learning] I tried to do something like passing an image
I tried machine learning with liblinear
(Machine learning) I tried to understand the EM algorithm in a mixed Gaussian distribution carefully with implementation.
I changed my job to a machine learning engineer at AtCoder Jobs
I tried to communicate with a remote server by Socket communication with Python.
Mayungo's Python Learning Episode 6: I tried to convert a character string to a number
I tried to make a calculator with Tkinter so I will write it
I tried to classify guitar chords in real time using machine learning
(Machine learning) I tried to understand Bayesian linear regression carefully with implementation.
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to classify mnist numbers by unsupervised learning [PCA, t-SNE, k-means]
A beginner of machine learning tried to predict Arima Kinen with python
I tried to visualize the model with the low-code machine learning library "PyCaret"
I tried to classify Oba Hana and Emiri Otani by deep learning
I tried to verify the result of A / B test by chi-square test
I want to create a machine learning service without programming! Text classification
I was addicted to trying Cython with PyCharm, so make a note
I tried to understand the learning function of neural networks carefully without using a machine learning library (first half).
I tried to create a linebot (preparation)
I tried learning with Kaggle's Titanic (kaggle②)
[Kaggle] I tried ensemble learning using LightGBM
I tried to make a Web API
I tried using pipenv, so a memo
Try to draw a "weather map-like front" by machine learning based on weather data (3)
[Python] Deep Learning: I tried to implement deep learning (DBN, SDA) without using a library.
Machine learning beginners tried to make a horse racing prediction model with python
Try to draw a "weather map-like front" by machine learning based on weather data (1)
Try to draw a "weather map-like front" by machine learning based on weather data (4)
[Azure] I tried to create a Linux virtual machine in Azure of Microsoft Learn
I tried to extract a line art from an image with Deep Learning
Try to draw a "weather map-like front" by machine learning based on weather data (2)
I tried to implement various methods for machine learning (prediction model) using scikit-learn.
I tried to process and transform the image and expand the data for machine learning