[Part.3] Crawling with Python! It's JSON rather than CSV! ??

at first

Oops! The bite is good and it seems that 2 wolves survive, so I will CO the black cat pillar!

If you save the data, you won't be able to understand how to handle commas, and JSON is more stable than CSV! ?? (I don't know ...) You don't need spreadsheets or Excel at all, right? What will I be able to say someday ...?

So, I've made a continuation of the last time, so I'll publish it.

Crawling x scraping with Python and Selenium!

--Part3: This page! I thought that it would be better to use JSON even if there were ":" or "," in the data, so I saved it as a List of JSON List! I think it feels good because I have fixed it a little.

Overview

I will explain the functions in order from the top!

--mkFN: A function that creates a file name! --gettxt2: Save the source to text while navigating the web page! This is done so that once you do it, all the pages you want will be displayed in order! This time, there is an element called "Next", and I tried to repeat "Click this → Get the element of the displayed page"! --trimming: I cut out all the files and saved them in another txt file! --removeTagForJSON: Format it into a JSON file! First of all, I removed the tag and put parentheses so that the last List does not have a comma, and it was troublesome to make fine adjustments ~ crying --ʻAddBlackets`: At the beginning and the end, I added a [] bracket! I've done a lot of redoing with the above fine adjustments quite a few times! Lol

Deliverables

sample.py


import re
from selenium import webdriver
from time import sleep

#I thought it would be better to open it with excel, so Shift_I went to JIS once, but were there any characters that couldn't be displayed? I don't know, but I gave up because I got an error.
mojicode = "utf8"

def mkFN(cnt,typeindex):
    types = [
        ["sample_", ".txt"],
        ["trimmed_", ".txt"],
        ["fin_", ".csv"],
        ["JSON_fin_",".json"],
    ]
    cntstr = str(cnt)
    if len(cntstr) == 1:
        cntstr = "0" + cntstr
    ans = "data/"
    ans += types[typeindex][0] + cntstr + types[typeindex][1]
    return ans

def gettxt2(cnt):
    url = "https://www.sample.com"
    path = "/Users/sample/Downloads/chromedriver"
    fn0 = "data/sample"
    fn1 = ".txt"
    
    driver = webdriver.Chrome(path)
    driver.get(url)
    sleep(3)
    output = driver.page_source
    filename = mkFN(0,0)
    with open(filename,"w",encoding=mojicode) as f:
        f.write(output)
    print(filename + ": done.")

    #I don't know why, but it seems like I have to initialize it again?
    output = driver.page_source
    sleep(3)

    for i in range(1,cnt):
        element = driver.find_element_by_link_text("Next")
        element.click()
        sleep(3)
        output = driver.page_source

        filename = mkFN(i,0)
        with open(filename,"w",encoding=mojicode) as f:
            f.write(output)
        print(filename + ": done.")


def trimming(cnt):
    filename = mkFN(cnt,0)
    filename2 = mkFN(cnt,1)
    with open(filename) as f:
        contents = f.read()
    regexen = [
        r'<tbody><tr class="jsgrid-row">',
        r'</table></div><div class="sample"',
    ]
    #The plural of index is indices
    indices = [0,0]

    for i in range(0,2):
        matchObj = re.search(regexen[i],contents)
        indices[i] = matchObj.start()
    rslt = contents[indices[0]:indices[1]]

    with open(filename2,"w",encoding=mojicode) as f2:
        f2.write(rslt)

def removeTagForJSON(cnt):
    beforeAfter = [
        [r'<tr.*?><td.*?>','\t["'],
        [r'</td><td.*?>','","'],
        [r'</td></tr>','"],\n'],
        [r'<.*?>', ""],
    ]

    with open(mkFN(cnt,1),encoding=mojicode) as f:
        contents = f.read()

    for i in range(0,4):
        contents = re.sub(beforeAfter[i][0],beforeAfter[i][1],contents)

    option = "a"
    if cnt == 0:
        option = "w"
    with open(mkFN("all",1),option,encoding=mojicode) as f:
        f.write(contents)

def addBlackets():
    with open(mkFN("all",1),encoding=mojicode) as f:
        contents = f.read()
    contents = "[\n" + contents[0:-2] + '\n]' 
    option = "w"
    with open(mkFN("all2",3),option,encoding=mojicode) as f:
        f.write(contents)

cnt = 20
gettxt2(cnt)

sleep(2)

for i in range(0,cnt):
    trimming(i)
print("trimming: done!")

for i in range(0,cnt):
    removeTagForJSON(i)
sleep(1)
addBlackets()

At the end

This time it's about 20 pages, so I made files one by one, but it seems to be difficult if I do not write it after properly shaping it when transitioning to 8000 pages! Lol But as long as I don't go to 100100 pages, is it okay to leave it as it is?

Recommended Posts

[Part.3] Crawling with Python! It's JSON rather than CSV! ??
[Part1] Scraping with Python → Organize to csv!
Read JSON with Python and output as CSV
Make JSON into CSV with Python from Splunk
[Python] Use JSON with Python
Csv tinkering with python
How to convert JSON file to CSV file with Python Pandas
Image processing with Python (Part 2)
Studying Python with freeCodeCamp part1
Read csv with python pandas
Bordering images with python Part 1
Scraping with Selenium + Python Part 1
POST json with Python3 script
Studying Python with freeCodeCamp part2
Image processing with Python (Part 1)
Solving Sudoku with Python (Part 2)
Image processing with Python (Part 3)
Write to csv with Python
Scraping with Selenium + Python Part 2
Format json with Vim (with python)
Download csv file with python
Read json data with python
[Part.2] Crawling with Python! Click the web page to move!
Transpose CSV files in Python Part 1
[Python] Convert CSV file uploaded to S3 to JSON file with AWS Lambda
Playing handwritten numbers with python Part 1
Make apache log csv with python
[Python] Write to csv file with Python
[Automation with python! ] Part 1: Setting file
Output to csv file with Python
JSON encoding and decoding with python
Handle Excel CSV files with Python
Reading and writing CSV with Python
Automate simple tasks with Python Part0
[Automation with python! ] Part 2: File operation
Excel aggregation with Python pandas Part 1
Data input / output in Python (CSV, JSON)
Play handwritten numbers with python Part 2 (identify)
FM modulation and demodulation with Python Part 3
Process Pubmed .xml data with python [Part 2]
Read CSV file with python (Download & parse CSV file)
Automate simple tasks with Python Part1 Scraping
Convert Excel data to JSON with python
100 Language Processing Knock with Python (Chapter 2, Part 2)
Working with Azure CosmosDB from Python Part.2
Excel aggregation with Python pandas Part 2 Variadic
Reading and writing JSON files with Python
100 Language Processing Knock with Python (Chapter 2, Part 1)
Optimize with optimization rather than inverse conversion
FM modulation and demodulation with Python Part 2
Web application production course learned with Flask of Python Part 2 Chapter 1 ~ JSON exchange ~
Python #JSON
Csv output from Google search with [Python]! 【Easy】
Scraping tabelog with python and outputting to CSV
Machine learning starting with Python Personal memorandum Part2
Create test data like that with Python (Part 1)
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Machine learning starting with Python Personal memorandum Part1
Reading and writing CSV and JSON files in Python
Generate an insert statement from CSV with Python.
Transpose CSV file in Python Part 2: Performance measurement