[PYTHON] i-Town Page Scraping: I Wanted To Replace Wise-kun

background

Excel called Kenshakun to comprehensively obtain specific information on i Town Page The macro was distributed, but it can no longer be used due to the specification change on the i-town page side in November 2019. So, as a practice of scraping, a Python beginner (a little touched on in a university lecture) challenged programming with a breath of breath. The ultimate goal is to "get the store name, address, and category from a specific small category."

Confirmation of terms

The following two points are prohibited in the terms of the Town Page. ・ Acts that have a great impact on the service of i-town page ・ The act of repeatedly accessing the i-town page using a program that automatically accesses https://itp.ne.jp/guide/web/notice/ This time, it's a fucking code that just pushes the button to load the continuation of the page to the bottom, so I assume that it is in the category of normal use rather than repeated access (If this is not possible, even normal use can not reach the bottom specification……).

environment

code

Install Selenium

On Anaconda Prompt

pip install selenium

Specifying the driver ・ Starting Chrome

driver = webdriver.Chrome(executable_path='/Users/*****/*****/Selenium/chromedriver/chromedriver.exe')
driver.get('https://itp.ne.jp/genre/?area=13&genre=13&subgenre=177&sort=01&sbmap=false')

After specifying the driver, launch the URL specified in the browser. Since I searched pachinko parlors this time, it is the URL when searching in the area "Tokyo" and the category "indoor play".

Display to the bottom line

while True:
    try:
        driver.find_element_by_class_name('m-read-more__text').click()
    except:
        print('☆ "More display" end of repeated hits ☆')
        break

Press the "Show more" button repeatedly until you get an error (= to the bottom line). This will display all the hit store information on the HTML.

Collect category names

elist = []
elems = driver.find_elements_by_class_name("m-article-card__header__category")
for e in elems:
     elist.append(e.text)
print(elist)

str_ = '\n'.join(elist)
print(str_)
with open("str_.txt",'w')as f:
    f.write(str_)

Create an empty list and throw the innerText of the element whose Class name is m-article-card__header__category into it. After that, the list is converted into sentences with line breaks one element at a time and output as text.

the end

flist = []
elems2 = driver.find_elements_by_class_name("m-article-card__header__title__link")
for f in elems2:
     flist.append(f.text)
print(flist)

str2_ = '\n'.join(flist)
print(str2_)
with open("str2_.txt",'w')as f:
    f.write(str2_)


glist = []
elems3 = elems2 = driver.find_elements_by_class_name("m-article-card__lead__caption")
for g in elems3:
     glist.append(g.text)
print(glist)

str3_ = '\n'.join(glist)
print(str3_)
with open("str3_.txt",'w')as f:
    f.write(str3_)


print('success')
driver.quit()

The title and caption (address, phone number, nearest station) are also output in the same way.

Where it gets stuck

Where it was resolved

・ Forget to add: (colon) after for ・ I don't know how to repeat it infinitely → While True: It was. I got stuck even when True started with a capital letter ・ I don't know how to write to a file → The cause was that the file name was not enclosed in "". ・ I don't know how to specify the location of chromeDriver → I only specified the folder that contains the Chrome driver. Of course, specify Chromedriver.exe -Even if you execute pip install selenium at the command prompt, it cannot be executed on Spyder. → Must be executed on the Anaconda Prompt side ...... Many others

Unresolved issues

・ I don't know what pip is ・ I don't understand the structure of html → I didn't understand after all, so I decided to run the browser with Selenium ・ In addition to [Address], [Telephone number] and [Nearest station] are included. → This is quite fatal, and it is probably better to write out one store as a group (this time it was processed in Excel). I plan to rewrite it if necessary in earnest

Other

・ The URL to display all stores nationwide is https://itp.ne.jp/genre/ ・ The URL for displaying stores in Tokyo is https://itp.ne.jp/genre/?area=13

2019/12/20 postscript

I noticed that it is easier to format by collecting by store with class = "o-result-article-list__item" instead of collecting by category, title, and address.

Recommended Posts

i-Town Page Scraping: I Wanted To Replace Wise-kun
I-town page scraping with selenium
Hash chain I wanted to avoid (2)
I wanted to evolve cGAN to ACGAN
Hash chain I wanted to avoid (1)
I want to monitor UNIQLO + J page updates [Scraping with python]
I wanted to solve ABC160 with Python
I wanted to solve ABC159 in Python
I wanted to solve ABC172 with Python
I really wanted to copy with selenium
Implemented DQN in TensorFlow (I wanted to ...)
I tried web scraping to analyze the lyrics.
I wanted to solve NOMURA Contest 2020 with Python
I tried to get an image by scraping
I wanted to play with the Bezier curve
I wanted to install Python 3.4.3 with Homebrew + pyenv
I want to sell Mercari by scraping python
I just wanted to understand Python's Pickle module
I tried scraping
I also wanted to check type hints with numpy
I wanted to use the Python library from MATLAB
I was addicted to scraping with Selenium (+ Python) in 2020
[Failure] I wanted to generate sentences using Flair's TextRegressor
I wanted to modify Django's admin site a bit