[PYTHON] Find the ideal property by scraping! A few minutes walk from the property to the destination

Introduction

Nice to meet you, I am about to move next year. "I'm not good at trains, so it's good to be within walking distance from my commute." With that in mind, I was looking for a property, but ... ** There is no function to display the distance from the specified destination to the property! ** ** SUUMO has a function to search from a map, but it is difficult to use because I can not see the details of the property in the list. .. So, I would like to do it with the spirit that if there is none, I should make it. This is the first time for me to be a complete beginner who has touched scraping, so please watch it warmly. The code is horribly dirty, but forgive me> < You should do this here! We look forward to hearing from you!

Ultimate goal

Get the rent (including management fee and common service fee) of the property within x minutes on foot from the specified address, a few minutes on foot, the property name, rent, number of floors, floor plan, occupied area, rental site of the property, URL of google map To do. Also, if you do not make a phase estimate, you may be ducked, so get information on the same property from multiple rental sites.

Today's goal point

Please note that there is not much explanation of the code. I will summarize it soon.

Premise

environment

Windows 10 version 20H2 Python 3.7.4 jupyter notebook

Required library

pip install beautifulsoup4
pip install selenium
pip install openpyxl
pip install xlwt

Download Chrome driver

You will need it to scrape with chrome. You can find it by searching for "Chrome driver download". You need to download a driver that matches your version of chrome.

Site to use

This time, the rental site will use SUUMO. In the future, we plan to scrape including other sites. The travel time is calculated using google map.

Assumption

Work location: Tokyo Sky Tree Where you want to live: Sumida Ward Working hours: within 15 minutes 1K Rent: 80,000 yen or less

I will write the code on the assumption that I will search for the property of.

code

Module import and purpose setting

First, import the modules you need.

import time
import pandas as pd

from bs4 import BeautifulSoup
from selenium import webdriver

Next, define your destination and how many minutes you can walk.

#Destination Tokyo Sky Tree
DESTINATION = '1-1-2 Oshiage, Sumida-ku, Tokyo'
#How many minutes to allow on foot
DURATION = 15

SUUMO site scraping

Then access the SUUMO site.

#SUUMO scraping
suumo_br = webdriver.Chrome('C:\\Users\\hohgehoge\\chromedriver') #For Windows, pass the path to chromedriver
# suumo_br = webdriver.Chrome() #For Mac
suumo_br.implicitly_wait(3)
#URL of suumo property search results
url_suumo = "https://suumo.jp/jj/chintai/ichiran/FR301FC001/?ar=030&bs=040&ta=13&sc=13107&cb=0.0&ct=8.0&co=1&et=9999999&md=02&cn=9999999&mb=0&mt=9999999&shkr1=03&shkr2=03&shkr3=03&shkr4=03&fw2="
suumo_br.get(url_suumo)
time.sleep(5)
print('I visited SUUMO')

For the variable url_suumo, enter the URL after narrowing down according to your favorite conditions. In this article, we will narrow down by 1K rent 80,000 or less properties in Sumida-ku, Tokyo.

The suumo site is now open. Next, parse the html of this site to get the address list of the property.

soup = BeautifulSoup(suumo_br.page_source, 'html.parser')
#List of property addresses
addresses = [c.get_text() for c in soup.find_all('li', class_='cassetteitem_detail-col1')]
print(addresses)

Output result

['Ryogoku 2 in Sumida-ku, Tokyo', 'Midori 4 Sumida-ku, Tokyo', '2 Higashimukojima, Sumida-ku, Tokyo', '2 Higashimukojima, Sumida-ku, Tokyo', '5 Yahiro, Sumida-ku, Tokyo', 'Midori 4 Sumida-ku, Tokyo', '3 Chitose, Sumida-ku, Tokyo', '1 Kyojima, Sumida-ku, Tokyo', '6 Higashimukojima, Sumida-ku, Tokyo', '6 Higashimukojima, Sumida-ku, Tokyo', '4 Tachibana, Sumida-ku, Tokyo', '5 Higashimukojima, Sumida-ku, Tokyo', '6 Yahiro, Sumida-ku, Tokyo', '4 Tatekawa, Sumida-ku, Tokyo', 'Ryogoku 2 in Sumida-ku, Tokyo', '2 Yahiro, Sumida-ku, Tokyo', '2 Sumida, Sumida-ku, Tokyo', '4 Tachibana, Sumida-ku, Tokyo', '4 Tachibana, Sumida-ku, Tokyo', '1 Tachibana, Sumida-ku, Tokyo', '1 Tachibana, Sumida-ku, Tokyo', '5 Mukojima, Sumida-ku, Tokyo', '1 Kikukawa, Sumida-ku, Tokyo', '6 Higashimukojima, Sumida-ku, Tokyo', '5 Yahiro, Sumida-ku, Tokyo', '1 Higashimukojima, Sumida-ku, Tokyo', '1 Higashimukojima, Sumida-ku, Tokyo', '2 Bunka, Sumida-ku, Tokyo', '5 Mukojima, Sumida-ku, Tokyo', '5 Yahiro, Sumida-ku, Tokyo']

Then get the name of the property. In the case of SUUMO, the station name may be entered in the place of the property name, but this time it is through.

#Confirmation of the number of properties
properties = soup.find_all('table', class_='cassetteitem_other')

#Get the building name
buildings = [c.get_text() for c in soup.find_all('div', class_='cassetteitem_content-title')]
print(buildings)

Output result


['Exclusive ID Ryogoku', 'Soara Plaza Kinshicho', 'Tokyo Mito Street', 'Tobu Isesaki Line Hikifune Station 11 stories 16 years old', 'Dolce Forest', 'Katsu Palace', 'Higuchi Heights', 'Keisei Oshiage Line Keisei Hikifune Station 5 stories 30 years old', 'Graceful Place', 'Keisei Oshiage Line Yahiro Station 2 stories 8 years old', 'Lyric Court Hiraibashi', 'Crayno Bonnur II', 'Prosperity Sky Tree', 'Like Kikukawa East', 'JR Sobu Line Ryogoku Station 7 stories 12 years old', 'Rigale Sumida Levante', 'Tobu Isesaki Line Kanegafuchi Station 3 stories new construction', 'Tobu Kamedo Line Higashi Azuma Station 3 stories 13 years old', 'Rilassante Tachibana', 'Stall house', 'Tobu-Kameido Line Omurai Station 4 stories 2 years old', 'Live City Mukojima', 'Bonnard', 'Mallage Nine', 'Beakasa Hikifune', 'Belfort', 'Tobu Isesaki Line Hikifune Station 3 stories 3 years old', 'El Viento Earth Sumida Azuma', 'Tobu Isesaki Line Hikifune Station 3 stories 6 years old', 'Keisei Oshiage Line Yahiro Station 3 stories 15 years old']

Since the xpath differs depending on whether one property handles multiple rentals or one property handles a single rental, in order to deal with this, the number of rentals handled by one property is changed. Get it.

#Count the number of properties on the site
properties_num_list = []
for prop in properties:
    
    prop = str(prop)
    properties_num_list.append(prop.count('<tbody>'))
print(properties_num_list)
# [1, 12, 8, 8, 1, 2, 3, 1, 3, 3, 1, 1, 5, 4, 1, 4, 1, 1, 1, 1, 1, 5, 1, 1, 1, 2, 2, 3, 2, 1]

Get travel time with google map

Go to google map.

browser = webdriver.Chrome('C:\\Users\\hogehoge\\chromedriver')
# browser = webdriver.Chrome() #For Mac
browser.implicitly_wait(3)

#googlemap url
url_map = "https://www.google.co.jp/maps/dir///@35.7130112,139.8029662,14.95z?hl=ja"
browser.get(url_map)
time.sleep(3)
print('I visited Google Map')

Since googlemap gives candidates for multiple travel times, define a function to get the shortest path.

#Function to check the shortest path time of Google Map
def shortest_path(travel_times):
    for travel_time in travel_times:
        min_travel_time = DURATION
        travel_time = int(travel_time.replace('Minutes', ''))
        if travel_time < min_travel_time:
            min_travel_time = travel_time
    # if min_travel_timee == DURATION:
    #     continue
    return min_travel_time

Get the travel time to Skytree from the address list you got earlier. The browser is operated automatically and the movement time is acquired.

element = browser.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[2]/div/div[3]/div[1]/div[1]/div[2]/div/div/input')
element.clear()
element.send_keys(DESTINATION)

#Calculate the distance from the destination

min_travel_times = []
map_url = []

for i, address in enumerate(addresses):
    element = browser.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[2]/div/div[3]/div[1]/div[2]/div[2]/div/div/input')
    element.clear()
    element.send_keys(address)
    search_button = browser.find_element_by_xpath('/html/body/jsl/div[3]/div[9]/div[3]/div[1]/div[2]/div/div[3]/div[1]/div[2]/div[2]/button[1]')
    search_button.click()
    time.sleep(3)
    soup = BeautifulSoup(browser.page_source, 'html.parser')
    #List of distances to destination
    travel_times = [c.get_text() for c in soup.find_all('div', class_='section-directions-trip-duration')]
    #Output the shortest distance to the target
    min_travel_times.append(shortest_path(travel_times))
    #Save google map url
    map_url.append(browser.current_url)

Get the property information you need with Beautiful Soup

Set the path to get the required property information.

#Property url
#When there are multiple rentals for one building
path_1 = '//*[@id="js-bukkenList"]/ul['
path_2 = ']/li['
path_3 = ']/div/div[2]/table/tbody['
path_4 = ']/tr/td[9]/a'
#When there is only one property for one rental
path_mono_1 = '//*[@id="js-bukkenList"]/ul['
path_mono_2 = ']/li['
path_mono_3 = ']/div/div[2]/table/tbody/tr/td[9]/a'

#Number of floors
#When there are multiple rentals for one building
path_floor = ']/tr/td[3]'
#When there is only one property for one rental
path_mono_floor = ']/div/div[2]/table/tbody[1]/tr/td[3]'

#Rent
#When there are multiple rentals for one building
path_rent = ']/tr/td[4]/ul/li[1]/span/span'
#When there is only one property for one rental
path_mono_rent = ']/div/div[2]/table/tbody/tr/td[4]/ul/li[1]/span/span'

#Management fee
#When there are multiple rentals for one building
path_fee = ']/tr/td[4]/ul/li[2]/span'
#When there is only one property for one rental
path_mono_fee = ']/div/div[2]/table/tbody[1]/tr/td[4]/ul/li[2]/span'

#Floor plan
#When there are multiple rentals for one building
path_plan = ']/tr/td[6]/ul/li[1]/span'
#When there is only one property for one rental
path_mono_plan = ']/div/div[2]/table/tbody[1]/tr/td[6]/ul/li[1]/span'

#Occupied area
#When there are multiple rentals for one building
path_area = ']/tr/td[6]/ul/li[2]/span'
#When there is only one property for one rental
path_mono_area = ']/div/div[2]/table/tbody[1]/tr/td[6]/ul/li[2]/span'

Write a function that adds up the rent and management costs.

#A function that adds up rent and management costs
def calc_rent(rent, fee):
    str_rent = rent.replace('Ten thousand yen', '')
    float_rent = float(str_rent) * 10000
    str_fee = fee.replace('Circle', '')
    float_fee = float(str_fee)
    return float_rent + float_fee

Next, get the information you want based on the path and make it into a data frame. (I'm sorry the code is dirty> <)

df = pd.DataFrame(columns=['Building name', 'Commuting time', 'rent', 'Number of floors', 'Floor plan', 'Occupied area', 'map', 'url'])
i, j  = 1, 1
for prop_info in zip(min_travel_times, properties_num_list, buildings, map_url):
    if prop_info[0] > DURATION:
        #Continue if longer than allowed walking time
        print('Out of Duration')
        j += 1
        if j % 6 == 0:
            i += 1
            j = 1
        continue
    if prop_info[1] == 1:
        # url
        path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_3
        prop_url = suumo_br.find_element_by_xpath(path).get_attribute('href')
        #Number of floors
        path =  path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_floor
        prop_floor = suumo_br.find_element_by_xpath(path).text
        #rent
        path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_rent
        temp_rent = suumo_br.find_element_by_xpath(path).text
        #Management fee
        path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_fee
        temp_fee = suumo_br.find_element_by_xpath(path).text
        #Floor plan
        path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_plan
        prop_plan = suumo_br.find_element_by_xpath(path).text
        #Occupied area
        path = path_mono_1 + str(i) + path_mono_2 + str(j) + path_mono_area
        prop_area = suumo_br.find_element_by_xpath(path).text
        prop_rent = calc_rent(temp_rent, temp_fee)
        print(prop_url)
        df = df.append({'Building name': prop_info[2], 'Commuting time': prop_info[0], 'rent': prop_rent, 'Number of floors': prop_floor, 'Floor plan': prop_plan, 'Occupied area': prop_area, 'map': prop_info[3], 'url': prop_url}, ignore_index=True)
    else:
        for k in range(1, prop_info[1] + 1):
            path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_4
            prop_url = suumo_br.find_element_by_xpath(path).get_attribute('href')
            #Number of floors
            path =  path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_floor
            prop_floor = suumo_br.find_element_by_xpath(path).text
            #rent
            path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_rent
            temp_rent = suumo_br.find_element_by_xpath(path).text
            #Management fee
            path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_fee
            temp_fee = suumo_br.find_element_by_xpath(path).text
            #Floor plan
            path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_plan
            prop_plan = suumo_br.find_element_by_xpath(path).text
            #Occupied area
            path = path_1 + str(i) + path_2 + str(j) + path_3 + str(k) + path_area
            prop_area = suumo_br.find_element_by_xpath(path).text
            prop_rent = calc_rent(temp_rent, temp_fee)
            print(prop_url)
            df = df.append({'Building name': prop_info[2], 'Commuting time': prop_info[0], 'rent': prop_rent, 'Number of floors': prop_floor, 'Floor plan': prop_plan, 'Occupied area': prop_area, 'map': prop_info[3], 'url': prop_url}, ignore_index=True)
    j += 1
    if j % 6 == 0:
        i += 1
        j = 1

I will check the contents.

df.head()

to excel file

Finally, make an Excel file and finish.

df.to_excel('sample.xlsx', encoding='utf_8_sig', index=False)

Future tasks

Currently, only one page is loaded, so we will improve it so that it loads all pages. We will also scrape another rental site. I would appreciate it if you could let me know if there are other functions or property information that may be useful.