Hello, this is @ 0yan. Today, I tried web scraping-data analysis as a practical practice of data analysis. I'd like to buy a second-hand condominium someday, so I chose the second-hand condominium information on the route I'm interested in.
The packages used are as follows.
import datetime
import re
import time
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import pandas as pd
import requests
The page I scraped is
It was a configuration. Finally,
[{'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam},
{'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam},
・ ・ ・,
{'Property Name': spam, 'Selling price': spam, 'location': spam, 'Along the line / station': spam,
'Occupied area': spam, 'Floor plan': spam, 'balcony': spam, 'Date of construction': spam}]
I want to create a list of elements with dictionary-type property information and pass it to pandas.DataFrame, so I scraped it with the following procedure (see comments).
#Initialization of the variable (dictionary type) to store the property information and the variable (list) to store it
property_dict = {}
properties_list = []
# 「property_List of keys required to generate "dict" "key"_generate "list"
key_list = ['Property Name', 'Selling price', 'location', 'Along the line / station', 'Occupied area', 'Floor plan', 'balcony', 'Date of construction']
key_list *= 30 #30 properties/page
#Beautiful Soup instance generation on page 1
first_page_url = 'A certain real estate site used apartment information search result (1st page) URL'
res = requests.get(first_page_url)
soup = BeautifulSoup(res.text, 'html.parser')
#Repeat for 93 pages (Since it is only once, acquisition of the maximum number of pages is omitted)
for page in range(93):
#List of dd tags containing property data "dd"_generate "list"
dd_list = [re.sub('[\n\r\t\xa0]', '', x.get_text()) for x in soup.select('dd')] #Exclude unnecessary line breaks, etc.
dd_list = dd_list[8:] #Exclude unnecessary page TOP data
#Dictionary-type property data is a list of elements "properties"_generate "list"
zipped = zip(key_list, dd_list)
for i, z in enumerate(zipped, start=1):
if i % 8 == 0:
properties_list.append(property_dict)
property_dict = {}
else:
property_dict.update({z[0]: z[1]})
#Get the URL of the next page (after the base part)
next_page = soup.select('p.pagination-parts>a')[-1]
#Create a Beautiful Soup instance on the next page
base_url = 'https://xxx.co.jp/' #URL of a real estate site
dynamic_url = urljoin(base_url, next_page.get("href"))
time.sleep(3)
res = requests.get(dynamic_url)
soup = BeautifulSoup(res.text, 'html.parser')
Finally, I passed the property list properties_list
to pandas.DataFrame
and generated a DataFrame.
df = pd.DataFrame(properties_list)
It is troublesome to scrape every time and it puts a load on the site side, so I decided to write it to CSV once and then read it and use it.
csv_file = f'{datetime.date.today()}_Second-hand condominium purchase information.csv'
df.to_csv(csv_file, encoding='cp932', index=False)
df = pd.read_csv(csv_file, encoding='cp932')
I often hear the phrase "data analysis takes 80% of the time to preprocess", but I realized that "this is ...". The following pre-processing was performed.
import re
#Exclude records that have a selling price other than the amount
df = df[df['Selling price'].str.match('[0-9]*[Ten thousand]Circle') | df['Selling price'].str.match('[0-9]BillionCircle') | df['Selling price'].str.match('[0-9]*Billion[0-9]*Ten thousandCircle')]
#Selling price[Ten thousand yen]Add
price = df['Selling price'].apply(lambda x: x.replace('* Including rights', ''))
price = price.apply(lambda x: re.sub('([0-9]*)Billion([0-9]*)Ten thousand yen', r'\1\2', x)) # 1憶2000Ten thousand yen → 12000
price = price.apply(lambda x: re.sub('([0-9]*)100 million yen', r'\10000', x)) # 1100 million yen → 10000
price = price.apply(lambda x: re.sub('([0-9]*)Ten thousand yen', r'\1', x)) # 9000Ten thousand yen → 9000
price = price.apply(lambda x: x.replace('@00', '0')) #Cannot be considered → Convert to 0
price = price.apply(lambda x: x.replace('21900~31800', '0')) #Same as above
df['Selling price[Ten thousand yen]'] = price.astype('int')
df = df[df['Selling price[Ten thousand yen]'] > 0] #0 record exclusion
#Occupied area[m2]Add
df['Occupied area[m2]'] = df['Occupied area'].apply(lambda x: re.sub('(.*)m2.*', r'\1', x))
df['Occupied area[m2]'] = df['Occupied area[m2]'].apply(lambda x: re.sub('-', '0', x)).astype('float') #Cannot be considered → Convert to 0
df = df[df['Occupied area[m2]'] > 0] #0 record exclusion
#balcony[m2]Add
df['balcony[m2]'] = df['balcony'].apply(lambda x: re.sub('(.*)m2.*', r'\1', x))
df['balcony[m2]'] = df['balcony[m2]'].apply(lambda x: re.sub('-', '0', x)).astype('float') #Cannot be considered → Convert to 0
df = df[df['balcony[m2]'] > 0] #0 record exclusion
#Add routes
df['Route'] = df['Along the line / station'].apply(lambda x: re.sub('(.*line).*', r'\1', x, count=5))
#Add nearest station
df['Nearest station'] = df['Along the line / station'].apply(lambda x: re.sub('.*「(.*)」.*', r'\1', x, count=5))
#On foot[Minutes]Add
df['On foot[Minutes]'] = df['Along the line / station'].apply(lambda x: re.sub('.*Ayumu([0-9]*)Minutes.*', r'\1', x)).astype('int')
#Add prefectures
df['Prefectures'] = df['location'].apply(lambda x: re.sub('(.*?[Prefectures]).*', r'\1', x))
#Add city
df['Municipality'] = df['location'].apply(lambda x: re.sub('.*[Prefectures](.*?[Municipality]).*',r'\1',x))
#Overwrite df of required column
df = df[['Property Name', 'Selling price[Ten thousand yen]', 'Route', 'Nearest station', 'On foot[Minutes]',
'Floor plan', 'Occupied area[m2]', 'balcony[m2]',
'Prefectures', 'Municipality', 'location']]
I analyzed the entire line I want to live in and the 2LDK or more of the stations (3 stations) that I am particularly interested in.
route = ['Line A', 'B line', 'C line', 'D line', 'E line', 'F line']
floor_plan = ['2LDK', '2LDK+S (storage room)',
'3DK', '3DK+S (storage room)', '3LDK', '3LDK+S (storage room)', '3LDK+2S (storage room)',
'4DK', '4DK+S (storage room)', '4LDK', '4LDK+S (storage room)']
filtered = df[df['Route'].isin(route) & df['Floor plan'].isin(floor_plan)]
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.pairplot(filtered)
plt.show()
There seems to be a slight correlation between the selling price and the occupied area, but other than that, it does not seem to affect the selling price so much.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.boxplot(x='Selling price[Ten thousand yen]', data=filtered)
plt.show()
Of the properties on the entire route, 50% were settled at around 40-75 million yen. After all it is expensive in Tokyo ...
filtered.describe()
That was all for the entire route.
From here, I checked the stations (3 stations) that I was interested in.
station = ['A station', 'B station', 'C station']
grouped = filtered[filtered['Nearest station'].isin(station)]
We investigated the distribution of selling prices at the three stations.
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.violinplot(x='Selling price[Ten thousand yen]', y='Nearest station', data=grouped)
plt.show()
The one I'm most interested in is the bottom (green), but it has a bimodal distribution. Perhaps it is polarized between Tawaman and others.
grouped.describe()
I wondered if there was a property that cost less than 50 million yen at the station that I was most interested in, so I analyzed it further.
c = filtered[filtered['Nearest station'] == 'C station']
c.groupby(by='Floor plan')['Floor plan'].count()
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.swarmplot(x='Selling price[Ten thousand yen]', y='Floor plan', data=c)
plt.show()
There were only 7 cases for less than 50 million yen ...
I further investigated the properties for less than 50 million yen.
c_u5k = c[c['Selling price[Ten thousand yen]'] < 5000]
c_u5k = c_u5k[['Property Name', 'Floor plan', 'Occupied area[m2]', 'balcony[m2]',
'Selling price[Ten thousand yen]', 'location', 'On foot[Minutes]']].sort_values(['Property Name', 'Floor plan'])
c_u5k
Since there are only 7 cases, when I looked up the location on Google Maps, I was interested in 1471. 3LDK, 69.5㎡ is 42.8 million yen, which seems to be cheap near this station, but is it really cheap from the market price of 3LDK of 3 stations that you are interested in in the first place? I looked it up.
grouped[grouped['Floor plan'] == '3LDK'].describe()
result,
It was quite cheap. After investigating why, it was discovered that the property was built in 1985 and is quite old. Since there is no interior photo and it says "It can be remodeled!", It can be inferred that it is aging considerably.
However, even if it costs 10 million yen for renovation, it is still cheap. I wish I had the money ... these days.
It was my first time to do web scraping, but it was a very useful practice because it is likely to be needed in the future. It is said that "what you like is good at things", but it was a day when I felt again that doing what I wanted to do, such as analyzing what I was interested in, was the fastest way to improve.
I will study at Kame @ US Data Scientist's site "Introduction to Python for Data Science". I did. The important points are summarized in an easy-to-understand manner. I recommend it.
I learned from the following articles when doing web scraping.
@ tomson784's article Scraping while repeating page transitions in Python
@ Chanmoro's article Beautiful Soup in 10 minutes
Thank you very much!
Recommended Posts