[Part1] Scraping with Python → Organize to csv!

at first

Oops! This is Jesse. I've just survived from the mansion, and I'm just a beginner. I'll put it together so that I can refer to it, just in case there are about the same number of people! Oh, and I'm writing because I'm lucky if a strong person opens up and gives me some advice! Lol

Oh, I'm using python3!

Crawling x scraping with Python and Selenium!

--Part1: This page! Practice getting and cleaning the source in one page!

Contents

In turn, write what you did.

Get the source from the URL

I was wondering if I would do it with Requests, but is this limited to those with public APIs? I didn't know, so I imported and used selenium's webdriver.

If you didn't have selenium installed Mac: Try running "python3 -m pip install selenium" in your terminal! (I don't know why, but I'm done with this!) Windows: It would be nice to see this! → https://www.seleniumqref.com/introduction/python/Python_Sele_Ins.html

sample.py


import re
from selenium import webdriver
from time import sleep

def gettxt():
    #Specify the URL of the web page you want to extract the source from, the path of the chromedriver, and the file name to save the contents!
    url = "https://www.sample.com"
    path = "/Users/sample/Downloads/chromedriver"
    filename = "data/sample.txt"

    driver = webdriver.Chrome(path)
    driver.get(url)
    sleep(5)
    output = driver.page_source

    with open(filename,"w",encoding="utf8") as f:
        f.write(output)
    sleep(3)
    driver.quit()

Bassari cut the unnecessary part!

sample.py


def trimming():
    #Specify the name of the file to read and the file to write!
    filename = "data/sample.txt"
    filename2 = "data/sample2.txt"
    with open(filename) as f:
        cntnt = f.read()

    #Read the source of the place you want to trim and enter the start and end strings!
    regexen = [
        r'<tbody><tr class="The beginning class of the information you want">',
        r'</td></tr></tbody></table></div><div class="Next class with the information you want"',
    ]
    #The plural of index is indices
    indices = [0,0]

    for i in range(0,2):
        matchObj = re.search(regexen[i],cntnt)
        indices[i] = matchObj.start()
    rslt = cntnt[indices[0]:indices[1]]

    with open(filename2,"w",encoding="utf8") as f2:
        f2.write(rslt)

trouble shooting

TypeError: expected string or bytes-like object I got this error, It was a pain to just forget the next last ()! Lol () Is required for functions that do not take arguments, so I'm not used to it.

sample.py


    matchObj.start()

Remove the tag

First, convert the line breaks. Next, add a comma. Finally, I erased all the tags!

I thought it would be nice to write it in a loop of numbers and a list, but so that you can see the correspondence, Dictionary or Map was better! Lol

I wrote the commented part again below!

sample.py


def removeTag():
    filename2 = "data/sample2.txt"
    filename3 = "data/sample3.csv"

    #Any single character is an arbitrary character string with 0 or more characters repeated!
    regex0 = r'<.*>'
    #But above, the very beginning<From the last>Don't get everything up to.
    #Behind the asterisk?If you add, it will be picked up from the front.
    regex = r'<.*?>'

    regexen = [
        r'</tr>',
        r'</td>',
        r'<.*?>',
    ]
    after = [
        "\n",
        ",",
        "",
    ]

    with open(filename2) as f:
        contents = f.read()

    for i in range(0,3):
        contents = re.sub(regexen[i],after[i],contents)

    contents += ","

    with open(filename3,"w") as f:
        f.write(contents)

Regular expression shortest match

I didn't know it at all, but when I searched for a regular expression, I didn't find the first match from the front! Rather, they usually fetch the longest one. "Normal" is difficult ...

sample.py


    #Any single character is an arbitrary character string with 0 or more characters repeated!
    regex0 = r'<.*>'
    #But above, the very beginning<From the last>Don't get everything up to.
    #Behind the asterisk?If you add, it will be picked up from the front.
    regex = r'<.*?>'

Unpublished reference site

--I checked the plural form of regex: https://ejje.weblio.jp/content/regexen --Basic usage of Python regular expressions: https://uxmilk.jp/41416 --Regular expression: Match with the shortest match: http://www-creators.com/archives/1804 --Selenium API (reverse lookup): https://www.seleniumqref.com/api/webdriver_gyaku.html

At the end

I will update it if I can do something in the future!

Recommended Posts

[Part1] Scraping with Python → Organize to csv!
Scraping tabelog with python and outputting to CSV
Scraping with Selenium + Python Part 1
Write to csv with Python
Scraping with Selenium + Python Part 2
Output to csv file with Python
Scraping with Python
Scraping with Python
Python: Scraping Part 1
Python: Scraping Part 2
Automate simple tasks with Python Part1 Scraping
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Scraping with Selenium [Python]
Csv tinkering with python
Scraping with Python + PyQuery
Scraping RSS with Python
How to measure execution time with Python Part 1
Write CSV data to AWS-S3 with AWS-Lambda + Python
How to measure execution time with Python Part 2
Image processing with Python (Part 2)
I tried scraping with Python
Connect to BigQuery with Python
Studying Python with freeCodeCamp part1
Read csv with python pandas
Bordering images with python Part 1
Web scraping with python + JupyterLab
Scraping with selenium in Python
Scraping with chromedriver in python
Festive scraping with Python, scrapy
I tried to touch the CSV file with Python
Connect to Wikipedia with Python
Post to slack with Python 3
How to upload with Heroku, Flask, Python, Git (Part 3)
Scraping with Selenium in Python
I was addicted to scraping with Selenium (+ Python) in 2020
Image processing with Python (Part 1)
Scraping with Tor in Python
Solving Sudoku with Python (Part 2)
How to upload with Heroku, Flask, Python, Git (Part 1)
Image processing with Python (Part 3)
[Python] A memo to write CSV vertically with Pandas
Scraping weather forecast with python
How to upload with Heroku, Flask, Python, Git (Part 2)
Switch python to 2.7 with alternatives
[Python-pptx] Output PowerPoint font information to csv with python
I tried scraping with python
Web scraping beginner with python
Download csv file with python
I want to be able to analyze data with Python (Part 3)
I want to be able to analyze data with Python (Part 1)
I want to be able to analyze data with Python (Part 4)
I want to be able to analyze data with Python (Part 2)
[Python] Until scraping beginners save J-League standings to CSV files
Transpose CSV files in Python Part 1
Try scraping with Python + Beautiful Soup
Playing handwritten numbers with python Part 1
Link to get started with python
Make apache log csv with python
Scraping with Node, Ruby and Python