Hello, this is @ 0yan. Today, I tried web scraping-data analysis as a practical practice of data analysis. I'd like to buy a second-hand condominium someday, so I chose the second-hand condominium information on the route I'm interested in.

environment

Windows 10 Pro
Python 3.7.3（Anaconda）
Jupyter Notebook

What i did

Web scraping
CSV write / read
Data preprocessing
Analysis

1. Web scraping

The packages used are as follows.

Package used

import datetime
import re
import time
from urllib.parse import urljoin

from bs4 import BeautifulSoup
import pandas as pd
import requests

Flow of scraping

The page I scraped is

All property data is in the dd tag
8 data per property (property name, selling price, location, railway line / station, occupied area, floor plan, balcony, date of construction)
Has data of 30 properties per page (8 data / property x 30 properties / page = 120 data / page)
There is navigation menu data in the dd tag at the top and BOTTOM of the page

It was a configuration. Finally,

[{'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
  'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam},
 {'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
  'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam},
･ ･ ･,
 {'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
  'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam}]

I want to create a list of elements with dictionary-type property information and pass it to pandas.DataFrame, so I scraped it with the following procedure (see comments).

#Initialization of the variable (dictionary type) to store the property information and the variable (list) to store it
property_dict = {}
properties_list = []


# 「property_List of keys required to generate "dict" "key"_generate "list"
key_list = ['Property Name', 'Selling price', 'location', 'Along the line / station', 'Occupied area', 'Floor plan', 'balcony', 'Date of construction']
key_list *= 30  #30 properties/page


#Beautiful Soup instance generation on page 1
first_page_url = 'A certain real estate site used apartment information search result (1st page) URL'
res = requests.get(first_page_url) 
soup = BeautifulSoup(res.text, 'html.parser')


#Repeat for 93 pages (Since it is only once, acquisition of the maximum number of pages is omitted)
for page in range(93):
    #List of dd tags containing property data "dd"_generate "list"
    dd_list = [re.sub('[\n\r\t\xa0]', '', x.get_text()) for x in soup.select('dd')]  #Exclude unnecessary line breaks, etc.
    dd_list = dd_list[8:]  #Exclude unnecessary page TOP data

    #Dictionary-type property data is a list of elements "properties"_generate "list"
    zipped = zip(key_list, dd_list)
    for i, z in enumerate(zipped, start=1):
        if i % 8 == 0:
            properties_list.append(property_dict)
            property_dict = {}
        else:
            property_dict.update({z[0]: z[1]})
        
    #Get the URL of the next page (after the base part)
    next_page = soup.select('p.pagination-parts>a')[-1]
    
    #Create a Beautiful Soup instance on the next page
    base_url = 'https://xxx.co.jp/'  #URL of a real estate site
    dynamic_url = urljoin(base_url, next_page.get("href"))
    time.sleep(3)
    res = requests.get(dynamic_url)
    soup = BeautifulSoup(res.text, 'html.parser')

Finally, I passed the property list properties_list to pandas.DataFrame and generated a DataFrame.

df = pd.DataFrame(properties_list)

2. CSV write / read

It is troublesome to scrape every time and it puts a load on the site side, so I decided to write it to CSV once and then read it and use it.

csv_file = f'{datetime.date.today()}_Second-hand condominium purchase information.csv'
df.to_csv(csv_file, encoding='cp932', index=False)
df = pd.read_csv(csv_file, encoding='cp932')

3. Data preprocessing

I often hear the phrase "data analysis takes 80% of the time to preprocess", but I realized that "this is ...". The following pre-processing was performed.

Selling price: "120 million yen * including rights" → Converted to int
Occupied area: "54.66m2 * Including wall core" or "-" is mixed → Convert to float
Balcony: Same as above
Create columns for "route", "nearest station", and "walk" from stations along the line
Create columns for "prefecture" and "city" from the location
Filtering into a DataFrame with only the required columns

import re

#Exclude records that have a selling price other than the amount
df = df[df['Selling price'].str.match('[0-9]*[Ten thousand]Circle') | df['Selling price'].str.match('[0-9]BillionCircle') | df['Selling price'].str.match('[0-9]*Billion[0-9]*Ten thousandCircle')]

#Selling price[Ten thousand yen]Add
price = df['Selling price'].apply(lambda x: x.replace('* Including rights', ''))
price = price.apply(lambda x: re.sub('([0-9]*)Billion([0-9]*)Ten thousand yen', r'\1\2', x))  # 1憶2000Ten thousand yen → 12000
price = price.apply(lambda x: re.sub('([0-9]*)100 million yen', r'\10000', x))  # 1100 million yen → 10000
price = price.apply(lambda x: re.sub('([0-9]*)Ten thousand yen', r'\1', x))  # 9000Ten thousand yen → 9000
price = price.apply(lambda x: x.replace('@00', '0'))  #Cannot be considered → Convert to 0
price = price.apply(lambda x: x.replace('21900～31800', '0'))  #Same as above
df['Selling price[Ten thousand yen]'] = price.astype('int')
df = df[df['Selling price[Ten thousand yen]'] > 0]  #0 record exclusion

#Occupied area[m2]Add
df['Occupied area[m2]'] = df['Occupied area'].apply(lambda x: re.sub('(.*)m2.*', r'\1', x))
df['Occupied area[m2]'] = df['Occupied area[m2]'].apply(lambda x: re.sub('-', '0', x)).astype('float')  #Cannot be considered → Convert to 0
df = df[df['Occupied area[m2]'] > 0]  #0 record exclusion

#balcony[m2]Add
df['balcony[m2]'] = df['balcony'].apply(lambda x: re.sub('(.*)m2.*', r'\1', x))
df['balcony[m2]'] = df['balcony[m2]'].apply(lambda x: re.sub('-', '0', x)).astype('float')  #Cannot be considered → Convert to 0
df = df[df['balcony[m2]'] > 0]  #0 record exclusion

#Add routes
df['Route'] = df['Along the line / station'].apply(lambda x: re.sub('(.*line).*', r'\1', x, count=5))

#Add nearest station
df['Nearest station'] = df['Along the line / station'].apply(lambda x: re.sub('.*「(.*)」.*', r'\1', x, count=5))

#On foot[Minutes]Add
df['On foot[Minutes]'] = df['Along the line / station'].apply(lambda x: re.sub('.*Ayumu([0-9]*)Minutes.*', r'\1', x)).astype('int')

#Add prefectures
df['Prefectures'] = df['location'].apply(lambda x: re.sub('(.*?[Prefectures]).*', r'\1', x))

#Add city
df['Municipality'] = df['location'].apply(lambda x: re.sub('.*[Prefectures](.*?[Municipality]).*',r'\1',x))

#Overwrite df of required column
df = df[['Property Name', 'Selling price[Ten thousand yen]', 'Route', 'Nearest station', 'On foot[Minutes]',
         'Floor plan', 'Occupied area[m2]', 'balcony[m2]',
         'Prefectures', 'Municipality', 'location']]

4. Analysis

I analyzed the entire line I want to live in and the 2LDK or more of the stations (3 stations) that I am particularly interested in.

The entire route

filtering

route = ['Line A', 'B line', 'C line', 'D line', 'E line', 'F line']
floor_plan = ['2LDK', '2LDK+S (storage room)',
              '3DK', '3DK+S (storage room)', '3LDK', '3LDK+S (storage room)', '3LDK+2S (storage room)',
              '4DK', '4DK+S (storage room)', '4LDK', '4LDK+S (storage room)']
filtered = df[df['Route'].isin(route) & df['Floor plan'].isin(floor_plan)]

Correlation analysis

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.pairplot(filtered)
plt.show()

There seems to be a slight correlation between the selling price and the occupied area, but other than that, it does not seem to affect the selling price so much.

Distribution of selling prices

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.boxplot(x='Selling price[Ten thousand yen]', data=filtered)
plt.show()

Of the properties on the entire route, 50% were settled at around 40-75 million yen. After all it is expensive in Tokyo ...

Statistics

filtered.describe()

Median (50%) is 68.99m2, 54 million yen
25-75% 50% is 3980-75 million yen

That was all for the entire route.

Stations of particular interest on the line (3 stations)

From here, I checked the stations (3 stations) that I was interested in.

station = ['A station', 'B station', 'C station']
grouped = filtered[filtered['Nearest station'].isin(station)]

Distribution of selling prices

We investigated the distribution of selling prices at the three stations.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.violinplot(x='Selling price[Ten thousand yen]', y='Nearest station', data=grouped)
plt.show()

The one I'm most interested in is the bottom (green), but it has a bimodal distribution. Perhaps it is polarized between Tawaman and others.

Statistics

grouped.describe()

100 properties
The median (50%) is 68.3 million yen with an exclusive area of 64.26m2. It is expensive compared to 68.99m2 / 54 million yen for the entire route 50% of + 25-75% is 5299-80.87 million yen. Overall high price compared to 3980-75 million yen for the entire route

The station you are most interested in

I wondered if there was a property that cost less than 50 million yen at the station that I was most interested in, so I analyzed it further.

Analysis for each floor plan

Number of properties

c = filtered[filtered['Nearest station'] == 'C station']
c.groupby(by='Floor plan')['Floor plan'].count()

Selling price

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.swarmplot(x='Selling price[Ten thousand yen]', y='Floor plan', data=c)
plt.show()

There were only 7 cases for less than 50 million yen ...

Properties less than 50 million yen

I further investigated the properties for less than 50 million yen.

c_u5k = c[c['Selling price[Ten thousand yen]'] < 5000]
c_u5k = c_u5k[['Property Name', 'Floor plan', 'Occupied area[m2]', 'balcony[m2]',
               'Selling price[Ten thousand yen]', 'location', 'On foot[Minutes]']].sort_values(['Property Name', 'Floor plan'])
c_u5k

Since there are only 7 cases, when I looked up the location on Google Maps, I was interested in 1471. 3LDK, 69.5㎡ is 42.8 million yen, which seems to be cheap near this station, but is it really cheap from the market price of 3LDK of 3 stations that you are interested in in the first place? I looked it up.

grouped[grouped['Floor plan'] == '3LDK'].describe()

result,

Median is 73.7m2 / 71.8 million yen
1471 is 69.5m2, so if you compare it, it's 25%.
25% is 69.8m2 / 66.07 million yen → Difference is -0.3m2 / -232 70,000 yen

It was quite cheap. After investigating why, it was discovered that the property was built in 1985 and is quite old. Since there is no interior photo and it says "It can be remodeled!", It can be inferred that it is aging considerably.

However, even if it costs 10 million yen for renovation, it is still cheap. I wish I had the money ... these days.

Impressions

It was my first time to do web scraping, but it was a very useful practice because it is likely to be needed in the future. It is said that "what you like is good at things", but it was a day when I felt again that doing what I wanted to do, such as analyzing what I was interested in, was the fastest way to improve.

Sites that learned data analysis

I will study at Kame @ US Data Scientist's site "Introduction to Python for Data Science". I did. The important points are summarized in an easy-to-understand manner. I recommend it.

References

I learned from the following articles when doing web scraping.

@ tomson784's article Scraping while repeating page transitions in Python

@ Chanmoro's article Beautiful Soup in 10 minutes

Thank you very much!

[Python] Flow from web scraping to data analysis

environment

What i did

1. Web scraping

Package used

Flow of scraping

2. CSV write / read

3. Data preprocessing

4. Analysis

The entire route

filtering

Correlation analysis

Distribution of selling prices

Statistics

Stations of particular interest on the line (3 stations)

Distribution of selling prices

Statistics

The station you are most interested in

Analysis for each floor plan

Number of properties

Selling price

Properties less than 50 million yen

Impressions

Sites that learned data analysis

References