[Part.2] Crawling with Python! Click the web page to move!

at first

Oops! Recently, I'm Jesse (I'm from Human Wolf J ~ ♪) with only 15 reverse cats! The continuation of the last time has been completed, so it's open to the public! If you like, read from the previous one (https://qiita.com/Jessica_nao_/items/b9f38a4413e424e3e585)!

Thing you want to do

There is only one URL, but I wanted to extract all the data from the table, which has 20 pages of 100 items each, at once. I got the selenium webdriver to do my best! Oh, as the tag says, I'm using Python3!

Crawling x scraping with Python and Selenium!

--Part2: This page! I used Chrome Webdriver to click on an element in the page to move it! The data was saved as csv ~.

Contents

--First, I prepared a function mkFN that creates a file name. --Save the web page source to a text file. Only this, the behavior is different between the first page and the second and subsequent pages, so I wrote the repetition in one function gettxt2! On the page I wanted to gather information on, I could go to the next page by pressing the link that says "Next"! --Next, trimming. --Finally, I saved it in a csv file! I was careful that I had to start a new line for each item / loaded file!

Reflection

I think it was pretty smart to have a function to create a file name! Lol

Since the part to write the file has appeared many times, this is also

sample.py


 def mkFile():

I thought it would have been better to paste the process below into this and divide it into functions.

Also, I think this is probably the most stumbling block, It will take some time to load the page, so be sure to take a break! This is ↓↓

sample.py


 sleep(3):

Don't forget to import sleep from time first because you have a break!

Deliverables

sample.py


import re
from selenium import webdriver
from time import sleep

#I thought it would be better to open it with excel, so Shift_I went to JIS once,
#Are there any characters that cannot be displayed? I don't know, but I gave up because I got an error.
mojicode = "utf8"

def mkFN(cnt,typeindex):
    types = [
        ["sample_", ".txt"],
        ["trimmed_", ".txt"],
        ["fin_", ".csv"],
    ]
    cntstr = str(cnt)
    if len(cntstr) == 1:
        cntstr = "0" + cntstr
    ans = "data/"
    ans += types[typeindex][0] + cntstr + types[typeindex][1]
    return ans

def gettxt2(cnt):
    url = "https://www.sample.com"
    path = "/Users/sample/Downloads/chromedriver"
    fn0 = "data/sample"
    fn1 = ".txt"
    
    driver = webdriver.Chrome(path)
    driver.get(url)
    sleep(3)
    output = driver.page_source
    filename = mkFN(0,0)
    with open(filename,"w",encoding=mojicode) as f:
        f.write(output)
    print(filename + ": done.")

    #Does it seem like you have to initialize it again?
    output = driver.page_source
    sleep(3)

    for i in range(1,cnt):
        element = driver.find_element_by_link_text("Next")
        element.click()
        sleep(3)
        output = driver.page_source

        filename = mkFN(i,0)
        with open(filename,"w",encoding=mojicode) as f:
            f.write(output)
        print(filename + ": done.")


def trimming(cnt):
    filename = mkFN(cnt,0)
    filename2 = mkFN(cnt,1)
    with open(filename) as f:
        contents = f.read()
    regexen = [
        r'<tbody><tr class="jsgrid-row">',
        r'</td></tr></tbody></table></div><div class="jsgrid-pager-container"',
    ]
    #The plural of index is indices
    indices = [0,0]

    for i in range(0,2):
        matchObj = re.search(regexen[i],contents)
        indices[i] = matchObj.start()
    rslt = contents[indices[0]:indices[1]]

    with open(filename2,"w",encoding=mojicode) as f2:
        f2.write(rslt)

def removeTag(cnt):

    beforeAfter = [
        [r'</tr>', "\n"],
        [r'</td>', ","],
        [r'<.*?>', ""],
    ]

    with open(mkFN(cnt,1),encoding=mojicode) as f:
        contents = f.read()

    for i in range(0,3):
        contents = re.sub(beforeAfter[i][0],beforeAfter[i][1],contents)

    #Add commas and line breaks at the end of the file!
    contents += ",\n"
    
    option = "a"
    if cnt == 0:
        option = "w"
    with open(mkFN("all",2),option,encoding=mojicode) as f:
        f.write(contents)


cnt = 20
gettxt2(cnt)

print("gettxt: done!")
sleep(1)


for i in range(0,cnt):
    trimming(i)
print("trimming: done!")

sleep(1)

for i in range(0,cnt):
    removeTag(i)
print("removeTag: done!")

At the end

Because I wrestled for a few hours, I couldn't stop watching the information being added in a few minutes at the end! Lol I hope I get used to it a little more and get ready in about 30 minutes.

reference

--About writing files: https://www.javadrive.jp/python/file/index3.html#section3 --Find the element on the page and click! : Https://www.seleniumqref.com/api/python/element_get/Python_find_element_by_link_text.html

Recommended Posts

[Part.2] Crawling with Python! Click the web page to move!
Move the turtle to the place where you click the mouse with turtle in Python
[CleanArchitecture with Python] Part2: Frameworks & Drivers Layer: Introducing the Web
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Save images on the web to Drive with Python (Colab)
Get the source of the page to load infinitely with python.
[python, ruby] fetch the contents of a web page with selenium-webdriver
[Personal note] Web page scraping with python3
Download files on the web with Python
[Part1] Scraping with Python → Organize to csv!
The road to compiling to Python 3 with Thrift
How to crop the lower right part of the image with Python OpenCV
The easiest way to synthesize speech with python
Try to solve the man-machine chart with Python
Introduction to Tornado (1): Python web framework started with Tornado
Specify the Python executable to use with virtualenv
Say hello to the world with Python with IntelliJ
Try using the Python web framework Tornado Part 1
[python] Quickly fetch web page metadata with lassie
The easiest way to use OpenCV with python
[Introduction to Python3 Day 20] Chapter 9 Unraveling the Web (9.1-9.4)
How to measure execution time with Python Part 1
Introduction to Python with Atom (on the way)
Try using the Python web framework Tornado Part 2
How to measure execution time with Python Part 2
Extract data from a web page with Python
[Python] How to save images on the Web at once with Beautiful Soup
Try to solve the programming challenge book with python3
[Introduction to Python] How to iterate with the range function?
Try to visualize the room with Raspberry Pi, part 1
Try to solve the internship assignment problem with Python
The first algorithm to learn with Python: FizzBuzz problem
I tried to touch the CSV file with Python
How to upload with Heroku, Flask, Python, Git (Part 3)
I tried to solve the soma cube with python
[Python] How to specify the download location with youtube-dl
Convert the image in .zip to PDF with Python
I want to inherit to the back with python dataclass
Web application made with Python3.4 + Django (Part.1 Environment construction)
How to upload with Heroku, Flask, Python, Git (Part 1)
Specify MinGW as the compiler to use with Python
How to upload with Heroku, Flask, Python, Git (Part 2)
I tried to solve the problem with Python Vol.1
[Python] How to rewrite the table style with python-pptx [python-pptx]
Image processing with Python (Part 2)
I tried to find the entropy of the image with python
I want to be able to analyze data with Python (Part 3)
Connect to BigQuery with Python
I tried to simulate how the infection spreads with Python
Studying Python with freeCodeCamp part1
Bordering images with python Part 1
Web scraping with python + JupyterLab
I wanted to solve the Panasonic Programming Contest 2020 with Python
Scraping with Selenium + Python Part 1
The first API to make with python Djnago REST framework
I want to be able to analyze data with Python (Part 1)
Connect to Wikipedia with Python
Post to slack with Python 3
Probably the easiest way to create a pdf with Python3
Python> List> partitions = [0] * len (all_filepaths) / partitions [: test_set_size] = [1] * After creating a list with test_set_size> 0, set the front part to 1.
Move what you installed with pip to the conda environment