[PYTHON] I-town page scraping with selenium

Preface

Around 2020, the specifications of the i-town page changed, so I created a script corresponding to it. Create a script like the one below

Enter keywords and areas and search on the i-town page → Get the store name and address from the search results and output in csv format

Note

--Acts that have a great impact on the service of i-town page --The act of repeatedly accessing the i-town page using a program that automatically accesses --Acts that put a load on the server by using malicious programs and scripts

The program introduced in this article does not perform continuous access at a speed that greatly exceeds the speed that users normally use, so it does not fall under the prohibited items (it seems).

Also, since it is prohibited to copy it for use in an environment that can be viewed by a third party, Images of the site etc. will not be posted at the time of explanation

i Town Page

The site will be updated around 2020, and as you scroll down the search results, a button called ** Show more ** will appear. You can no longer get all search results (maximum display 1000) unless you press this many times.

Program overview

For the time being, I will post a brief explanation and the entire program. (Detailed explanation will be described later)

Create an input interface using PysimpleGUI (it doesn't matter if you don't have it) Start chrome (or firefox) using selenium webdriver, display the corresponding page & press all the display buttons Use beautifulsoup to get the necessary elements (this time, two types, store name and address) Mold data using pandas

main.py


#It is python3's app
#install selenium, beautifulsoup4, pandas with pip3
#download firefox, geckodriver
from selenium import webdriver
#from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import csv
import PySimpleGUI as sg

#plese download browserdriver and writedown driver's path
#bdriverpath='./chromedriver'
bdriverpath="C:\chromedriver.exe"

#make popup window
layout= [
    [sg.Text('Area >> ', size=(15,1)), sg.InputText('Machida')],
    [sg.Text('Keyword >> ', size=(15,1)), sg.InputText('convenience store')],
    [sg.Submit(button_text='OK')]
]
window = sg.Window('Area and Keyword', layout)

#popup
while True:
    event, values = window.read()

    if event is None:
        print('exit')
        break

    if event == 'OK':
        show_message = "Area is " + values[0] + "\n"
        show_message += "Keyword is " + values[1] + "\n"
        print(show_message)
        sg.popup(show_message)
        break

window.close()
area =values[0]
keyword = values[1]

#initialize webdriver
options = Options()
options.add_argument('--headless')
driver=webdriver.Chrome(options=options, executable_path=bdriverpath)

#search page with keyword and area
driver.get('https://itp.ne.jp')
driver.find_element_by_id('keyword-suggest').find_element_by_class_name('a-text-input').send_keys(keyword)
driver.find_element_by_id('area-suggest').find_element_by_class_name('a-text-input').send_keys(area)
driver.find_element_by_class_name('m-keyword-form__button').click()
time.sleep(5)

#find & click readmore button
try:
    while driver.find_element_by_class_name('m-read-more'):
        button = driver.find_element_by_class_name('m-read-more')
        button.click()
        time.sleep(1)
except NoSuchElementException:
    pass
res = driver.page_source
driver.quit()

#output with html
with open(area + '_' + keyword + '.html', 'w', encoding='utf-8') as f:
    f.write(res)

#parse with beautifulsoup
soup = BeautifulSoup(res, "html.parser")
shop_names = [n.get_text(strip=True) for n in soup.select('.m-article-card__header__title')]
shop_locates = [n.get_text(strip=True) for n in soup.find_all(class_='m-article-card__lead__caption', text=re.compile("Street address"))]

#incorporation lists with pandas
df = pd.DataFrame([shop_names, shop_locates])
df = df.transpose()

#output with csv
df.to_csv(area + '_' + keyword + '.csv', quoting=csv.QUOTE_NONE, index=False, encoding='utf_8_sig')

sg.popup("finished")

Explanation for each block

Environment

The following is the library group impoprt this time. All can be installed with pip3. The commented out part is whether to use chrome or firefox, so please rewrite it according to your preference and environment.

import.py


from selenium import webdriver
#from selenium.webdriver.firefox.options import Options
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
import csv
import PySimpleGUI as sg

driver If you want to use the webdriver explained later, you need chromedriver for chrome and geckodriver for firefox. Please download the applicable one from the following site. https://github.com/mozilla/geckodriver/releases https://chromedriver.chromium.org/downloads Also, at this time, it will not work unless the ** browser, python, and driver 3 versions ** you are using are engaged.

--First of all, please use the latest browser. --Download the driver accordingly --Use the latest with gecko (probably) --If it is chrome, the version name of chrome and the version name of driver are linked, so select the same one.


After downloading the driver, pass the path as an environment variable, or place it in an easy-to-understand location and write the path in the program. In my windows environment, it is placed directly under the C drive. I commented it out, but on linux (mac) I put it in the same place where the program is put and used it.

driver.py


#plese download browserdriver and writedown driver's path
#bdriverpath='./chromedriver'
bdriverpath="C:\chromedriver.exe"

PySimpleGUI References If you use Tkinter, try using PySimpleGUI

Decide the layout and write the default input (Machida, convenience store)

layout.py


#make popup window
layout= [
    [sg.Text('Area >> ', size=(15,1)), sg.InputText('Machida')],
    [sg.Text('Keyword >> ', size=(15,1)), sg.InputText('convenience store')],
    [sg.Submit(button_text='OK')]
]

Create a window and keep loading in a loop. When the OK button in the window is pressed, the input contents are read into values []. After the processing is completed, exit with window.close () and pass the input contents to the variables in the program.

window.py


window = sg.Window('Area and Keyword', layout)

#popup
while True:
    event, values = window.read()

    if event is None:
        print('exit')
        break

    if event == 'OK':
        show_message = "Area is " + values[0] + "\n"
        show_message += "Keyword is " + values[1] + "\n"
        print(show_message)
        sg.popup(show_message)
        break

window.close()
area =values[0]
keyword = values[1]

Start webdriver

webdriver (selenium) is a library for operating a normal browser (firefox, chrome, etc.) programmatically.

First, add --headless to the startup options. This is an option to run the browser in the background. If you want the browser to work automatically, comment out options.add_argument ('--headless'). Then launch chrome with driver = webdriver.Chrome (). At the same time, enter the option and the driver path. options = options, executable_path = briverpath

init.py


#initialize webdriver
options = Options()
options.add_argument('--headless')
driver=webdriver.Chrome(options=options, executable_path=briverpath)

Search with webdriver

Go to the top of the town page with driver.get. Find the input box for entering keywords and areas in driver.find ~, and also enter in .send_keys (). Also, find the search start button in the same way and press the button with .click ().

search.py


#search page with keyword and area
driver.get('https://itp.ne.jp')
driver.find_element_by_id('keyword-suggest').find_element_by_class_name('a-text-input').send_keys(keyword)
driver.find_element_by_id('area-suggest').find_element_by_class_name('a-text-input').send_keys(area)
driver.find_element_by_class_name('m-keyword-form__button').click()
time.sleep(5)

html example

For example, on the following page, the keyword input box has an id of keyword-suggest and a class of a-text-input.

keyword.html


<div data-v-1wadada="" id="keyword-suggest" class="m-suggest" data-v-1dadas14="">
<input data-v-dsadwa3="" type="text" autocomplete="off" class="a-text-input" placeholder="Enter a keyword" data-v-1bbdb50e=""> 
<!---->
</div>

Push the display further

Use a loop to keep pressing the more display button class_name = m-read-more as long as you find it. Also, if you try to find the same button immediately after pressing the button, the new button will not be loaded yet and will end in the middle, so set a waiting time with time.sleep (1) If the button is not found, the webdriver will cause an error and the program will end, so predict the error except in advance. After except, proceed to the next as it is, put the obtained html (all the display is pressed) in res,driver.quit () The web driver will exit

button.py


from selenium.common.exceptions import NoSuchElementException

#find & click readmore button
try:
    while driver.find_element_by_class_name('m-read-more'):
        button = driver.find_element_by_class_name('m-read-more')
        button.click()
        time.sleep(1)
except NoSuchElementException:
    pass
res = driver.page_source
driver.quit()

Output html

Just in case, I will output the html I got. Not required

html.py


#output with html
with open(area + '_' + keyword + '.html', 'w', encoding='utf-8') as f:
    f.write(res)

html analysis

Pass the html you got earlier to beautifulsoup. Search for an element with soup.select and get only the store name (address) with.get_text (). If you just use get_text (), line breaks and spaces will be included, but if you add the strip = True option, you will get only the characters you want. Regarding the address, on the town page site, the class class_name = m-article-card__lead__caption was set not only for the address but also for the telephone number and the nearest station, so only the address can be extracted by character string. I have it. text = re.compile ("address ")

parse.py


#parse with beautifulsoup
soup = BeautifulSoup(res, "html.parser")
shop_names = [n.get_text(strip=True) for n in soup.select('.m-article-card__header__title')]
shop_locates = [n.get_text(strip=True) for n in soup.find_all(class_='m-article-card__lead__caption', text=re.compile("Street address"))]

Data molding

I'm using pandas to organize my data. The data obtained by beautifulsoup is a list, so combine the two. That alone will result in landscape data, so use transpose () to make it portrait.

pandas.py


#incorporation lists with pandas
df = pd.DataFrame([shop_names, shop_locates])
df = df.transpose()

Data output

This time, I output it in csv format. Use the user-entered area and keyword for the file name. When the pandas data is output, it is numbered vertically, but since it is an obstacle, it is erased with index = False. Also, there is a problem that the output data is garbled when opened in Excel, so avoid it with encoding ='utf_8_sig'.

csv.py


#output with csv
df.to_csv(area + '_' + keyword + '.csv', quoting=csv.QUOTE_NONE, index=False, encoding='utf_8_sig')

At the end

I tried web scraping using selenium, but the impression was that the operation was not stable. Since the browser is actually running, the operation after loading or pressing the button is not guaranteed. This time I used time.sleep to avoid it. (I originally used selenium's implicit / explicit wait, but it didn't work for me.) Also, when I downloaded the webdriver, it was an old version for some reason, and I was suffering from an error for about 2 days without noticing it, so I was very angry (to myself)

Recommended Posts

I-town page scraping with selenium
Scraping with selenium
Scraping with Selenium
Successful scraping with Selenium
Scraping with Selenium [Python]
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Scraping with Selenium + Python Part 2
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Scraping with Python
Scraping with Python
Beginning with Selenium
[Personal note] Web page scraping with python3
Practice web scraping with Python and Selenium
Web scraping with BeautifulSoup4 (serial number page)
i-Town Page Scraping: I Wanted To Replace Wise-kun
Serverless scraping using selenium with [AWS Lambda] -Part 1-
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with scrapy shell
ScreenShot with Selenium (Python)
Python web scraping selenium
Scraping with Python + PyQuery
Scraping with Beautiful Soup
Scraping RSS with Python
I was addicted to scraping with Selenium (+ Python) in 2020
I can't manipulate iframes in a page with Selenium
I tried scraping with Python
Automatically download images with scraping
Web scraping with python + JupyterLab
Festive scraping with Python, scrapy
Save images with web scraping
Python: Working with Firefox with selenium
Scraping with Tor in Python
Web scraping using Selenium (Python)
Scraping weather forecast with python
Memories of fighting with Selenium
scraping the Nikkei 225 with playwright-python
Try Selenium Grid with Docker
[Python + Selenium] Tips for scraping
I tried scraping with python
Web scraping beginner with python
Table scraping with Beautiful Soup
[Python, Selenium, PhantomJS] A story when scraping a website with lazy load
I want to monitor UNIQLO + J page updates [Scraping with python]
I tried to log in to twitter automatically with selenium (RPA, scraping)
Try scraping with Python + Beautiful Soup
Scraping with Node, Ruby and Python
Web scraping with Python ① (Scraping prior knowledge)
Summary of scraping relations (selenium, pyautogui)
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Scraping pages with pagination with Beautiful Soup
Scraping with Beautiful Soup in 10 minutes
Make testing with Selenium more accessible
Let's do image scraping with Python
Scraping the SBI SECURITIES portfolio page
"Scraping & machine learning with Python" Learning memo