[Python] Flow from web scraping to data analysis

Hello, this is @ 0yan. Today, I tried web scraping-data analysis as a practical practice of data analysis. I'd like to buy a second-hand condominium someday, so I chose the second-hand condominium information on the route I'm interested in.

environment

What i did

  1. Web scraping
  2. CSV write / read
  3. Data preprocessing
  4. Analysis

1. Web scraping

The packages used are as follows.

Package used

import datetime
import re
import time
from urllib.parse import urljoin

from bs4 import BeautifulSoup
import pandas as pd
import requests

Flow of scraping

The page I scraped is

It was a configuration. Finally,

[{'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
  'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam},
 {'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
  'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam},
・ ・ ・,
 {'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
  'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam}]

I want to create a list of elements with dictionary-type property information and pass it to pandas.DataFrame, so I scraped it with the following procedure (see comments).

#Initialization of the variable (dictionary type) to store the property information and the variable (list) to store it
property_dict = {}
properties_list = []


# 「property_List of keys required to generate "dict" "key"_generate "list"
key_list = ['Property Name', 'Selling price', 'location', 'Along the line / station', 'Occupied area', 'Floor plan', 'balcony', 'Date of construction']
key_list *= 30  #30 properties/page


#Beautiful Soup instance generation on page 1
first_page_url = 'A certain real estate site used apartment information search result (1st page) URL'
res = requests.get(first_page_url) 
soup = BeautifulSoup(res.text, 'html.parser')


#Repeat for 93 pages (Since it is only once, acquisition of the maximum number of pages is omitted)
for page in range(93):
    #List of dd tags containing property data "dd"_generate "list"
    dd_list = [re.sub('[\n\r\t\xa0]', '', x.get_text()) for x in soup.select('dd')]  #Exclude unnecessary line breaks, etc.
    dd_list = dd_list[8:]  #Exclude unnecessary page TOP data

    #Dictionary-type property data is a list of elements "properties"_generate "list"
    zipped = zip(key_list, dd_list)
    for i, z in enumerate(zipped, start=1):
        if i % 8 == 0:
            properties_list.append(property_dict)
            property_dict = {}
        else:
            property_dict.update({z[0]: z[1]})
        
    #Get the URL of the next page (after the base part)
    next_page = soup.select('p.pagination-parts>a')[-1]
    
    #Create a Beautiful Soup instance on the next page
    base_url = 'https://xxx.co.jp/'  #URL of a real estate site
    dynamic_url = urljoin(base_url, next_page.get("href"))
    time.sleep(3)
    res = requests.get(dynamic_url)
    soup = BeautifulSoup(res.text, 'html.parser')

Finally, I passed the property list properties_list to pandas.DataFrame and generated a DataFrame.

df = pd.DataFrame(properties_list)

2. CSV write / read

It is troublesome to scrape every time and it puts a load on the site side, so I decided to write it to CSV once and then read it and use it.

csv_file = f'{datetime.date.today()}_Second-hand condominium purchase information.csv'
df.to_csv(csv_file, encoding='cp932', index=False)
df = pd.read_csv(csv_file, encoding='cp932')

3. Data preprocessing

I often hear the phrase "data analysis takes 80% of the time to preprocess", but I realized that "this is ...". The following pre-processing was performed.

import re

#Exclude records that have a selling price other than the amount
df = df[df['Selling price'].str.match('[0-9]*[Ten thousand]Circle') | df['Selling price'].str.match('[0-9]BillionCircle') | df['Selling price'].str.match('[0-9]*Billion[0-9]*Ten thousandCircle')]

#Selling price[Ten thousand yen]Add
price = df['Selling price'].apply(lambda x: x.replace('* Including rights', ''))
price = price.apply(lambda x: re.sub('([0-9]*)Billion([0-9]*)Ten thousand yen', r'\1\2', x))  # 1憶2000Ten thousand yen → 12000
price = price.apply(lambda x: re.sub('([0-9]*)100 million yen', r'\10000', x))  # 1100 million yen → 10000
price = price.apply(lambda x: re.sub('([0-9]*)Ten thousand yen', r'\1', x))  # 9000Ten thousand yen → 9000
price = price.apply(lambda x: x.replace('@00', '0'))  #Cannot be considered → Convert to 0
price = price.apply(lambda x: x.replace('21900~31800', '0'))  #Same as above
df['Selling price[Ten thousand yen]'] = price.astype('int')
df = df[df['Selling price[Ten thousand yen]'] > 0]  #0 record exclusion

#Occupied area[m2]Add
df['Occupied area[m2]'] = df['Occupied area'].apply(lambda x: re.sub('(.*)m2.*', r'\1', x))
df['Occupied area[m2]'] = df['Occupied area[m2]'].apply(lambda x: re.sub('-', '0', x)).astype('float')  #Cannot be considered → Convert to 0
df = df[df['Occupied area[m2]'] > 0]  #0 record exclusion

#balcony[m2]Add
df['balcony[m2]'] = df['balcony'].apply(lambda x: re.sub('(.*)m2.*', r'\1', x))
df['balcony[m2]'] = df['balcony[m2]'].apply(lambda x: re.sub('-', '0', x)).astype('float')  #Cannot be considered → Convert to 0
df = df[df['balcony[m2]'] > 0]  #0 record exclusion

#Add routes
df['Route'] = df['Along the line / station'].apply(lambda x: re.sub('(.*line).*', r'\1', x, count=5))

#Add nearest station
df['Nearest station'] = df['Along the line / station'].apply(lambda x: re.sub('.*「(.*)」.*', r'\1', x, count=5))

#On foot[Minutes]Add
df['On foot[Minutes]'] = df['Along the line / station'].apply(lambda x: re.sub('.*Ayumu([0-9]*)Minutes.*', r'\1', x)).astype('int')

#Add prefectures
df['Prefectures'] = df['location'].apply(lambda x: re.sub('(.*?[Prefectures]).*', r'\1', x))

#Add city
df['Municipality'] = df['location'].apply(lambda x: re.sub('.*[Prefectures](.*?[Municipality]).*',r'\1',x))

#Overwrite df of required column
df = df[['Property Name', 'Selling price[Ten thousand yen]', 'Route', 'Nearest station', 'On foot[Minutes]',
         'Floor plan', 'Occupied area[m2]', 'balcony[m2]',
         'Prefectures', 'Municipality', 'location']]

4. Analysis

I analyzed the entire line I want to live in and the 2LDK or more of the stations (3 stations) that I am particularly interested in.

The entire route

filtering

route = ['Line A', 'B line', 'C line', 'D line', 'E line', 'F line']
floor_plan = ['2LDK', '2LDK+S (storage room)',
              '3DK', '3DK+S (storage room)', '3LDK', '3LDK+S (storage room)', '3LDK+2S (storage room)',
              '4DK', '4DK+S (storage room)', '4LDK', '4LDK+S (storage room)']
filtered = df[df['Route'].isin(route) & df['Floor plan'].isin(floor_plan)]

Correlation analysis

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.pairplot(filtered)
plt.show()

image.png

There seems to be a slight correlation between the selling price and the occupied area, but other than that, it does not seem to affect the selling price so much.

Distribution of selling prices

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.boxplot(x='Selling price[Ten thousand yen]', data=filtered)
plt.show()

image.png

Of the properties on the entire route, 50% were settled at around 40-75 million yen. After all it is expensive in Tokyo ...

Statistics

filtered.describe()

image.png

That was all for the entire route.

Stations of particular interest on the line (3 stations)

From here, I checked the stations (3 stations) that I was interested in.

station = ['A station', 'B station', 'C station']
grouped = filtered[filtered['Nearest station'].isin(station)]

Distribution of selling prices

We investigated the distribution of selling prices at the three stations.

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.violinplot(x='Selling price[Ten thousand yen]', y='Nearest station', data=grouped)
plt.show()

image.png

The one I'm most interested in is the bottom (green), but it has a bimodal distribution. Perhaps it is polarized between Tawaman and others.

Statistics

grouped.describe()

image.png

The station you are most interested in

I wondered if there was a property that cost less than 50 million yen at the station that I was most interested in, so I analyzed it further.

Analysis for each floor plan

Number of properties
c = filtered[filtered['Nearest station'] == 'C station']
c.groupby(by='Floor plan')['Floor plan'].count()

image.png

Selling price
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.swarmplot(x='Selling price[Ten thousand yen]', y='Floor plan', data=c)
plt.show()

image.png

There were only 7 cases for less than 50 million yen ...

Properties less than 50 million yen

I further investigated the properties for less than 50 million yen.

c_u5k = c[c['Selling price[Ten thousand yen]'] < 5000]
c_u5k = c_u5k[['Property Name', 'Floor plan', 'Occupied area[m2]', 'balcony[m2]',
               'Selling price[Ten thousand yen]', 'location', 'On foot[Minutes]']].sort_values(['Property Name', 'Floor plan'])
c_u5k

image.png

Since there are only 7 cases, when I looked up the location on Google Maps, I was interested in 1471. 3LDK, 69.5㎡ is 42.8 million yen, which seems to be cheap near this station, but is it really cheap from the market price of 3LDK of 3 stations that you are interested in in the first place? I looked it up.

grouped[grouped['Floor plan'] == '3LDK'].describe()

image.png

result,

It was quite cheap. After investigating why, it was discovered that the property was built in 1985 and is quite old. Since there is no interior photo and it says "It can be remodeled!", It can be inferred that it is aging considerably.

However, even if it costs 10 million yen for renovation, it is still cheap. I wish I had the money ... these days.

Impressions

It was my first time to do web scraping, but it was a very useful practice because it is likely to be needed in the future. It is said that "what you like is good at things", but it was a day when I felt again that doing what I wanted to do, such as analyzing what I was interested in, was the fastest way to improve.

Sites that learned data analysis

I will study at Kame @ US Data Scientist's site "Introduction to Python for Data Science". I did. The important points are summarized in an easy-to-understand manner. I recommend it.

References

I learned from the following articles when doing web scraping.

@ tomson784's article Scraping while repeating page transitions in Python

@ Chanmoro's article Beautiful Soup in 10 minutes

Thank you very much!

Recommended Posts

[Python] Flow from web scraping to data analysis
Data analysis for improving POG 1 ~ Web scraping with Python ~
Meteorology x Python ~ From weather data acquisition to spectrum analysis ~
Data analysis python
Python: Reading JSON data from web API
[Python] Web application from 0! Hands-on (4) -Data molding-
Ported from R language of "Sazae-san's rock-paper-scissors data analysis" to Python
Introduction to Data Analysis with Python P17-P26 [ch02 1.usa.gov data from bit.ly]
Data analysis with python 2
How to open a web browser from python
Changes from Python 3.0 to Python 3.5
Changes from Python 2 to Python 3.0
Data analysis overview python
[Python] How to read data from CIFAR-10 and CIFAR-100
Python data analysis template
Python web scraping selenium
Data analysis with Python
Extract data from a web page with Python
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
[2021 version] From Selenium Basic installation to Web scraping execution by Python Windows 10 (64bit)
WEB scraping with python and try to make a word cloud from reviews
[For beginners] How to study Python3 data analysis exam
How to scrape image data from flickr with python
Reading Note: An Introduction to Data Analysis with Python
Send data from Python to Processing via socket communication
[Technical book] Introduction to data analysis using Python -1 Chapter Introduction-
My python data analysis container
Post from Python to Slack
Web scraping with python + JupyterLab
Python for Data Analysis Chapter 4
Web scraping notes in python3
Anaconda updated from 4.2.0 to 4.3.0 (python3.5 updated to python3.6)
Python data analysis learning notes
Horse Racing Data Scraping Flow
Web scraping using Selenium (Python)
Switch from python2.7 to python3.6 (centos7)
Connect to sqlite from python
Python for Data Analysis Chapter 2
Web scraping beginner with python
Data analysis using python pandas
Python for Data Analysis Chapter 3
[Updated from time to time] Python memos often used for data analysis [N division, etc.]
[Impression] [Data analysis starting from zero] Introduction to Python data science learned in business cases
[Python] Introduction to scraping | Program to open web pages (selenium webdriver)
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Scraping desired data from website by linking Python and Excel
Hit REST in Python to get data from New Relic
Pass OpenCV data from the original C ++ library to Python
Call Matlab from Python to optimize
From Elasticsearch installation to data entry
[Python] How to FFT mp3 data
Python: Time Series Analysis: Preprocessing Time Series Data
Web scraping with Python ① (Scraping prior knowledge)
Get data from Quandl in Python
Create folders from '01' to '12' with python
Post from python to facebook timeline
Web scraping with Python First step
I tried web scraping with python.
[Lambda] [Python] Post to Twitter from Lambda!
Python (from first time to execution)
Post images from Python to Tumblr