Python beginners get stuck with their first web scraping

Introduction

When I tried web scraping without even knowing the grammar of Python, I got stuck in various ways, so I will summarize it with a memorandum. The implementation is a program that acquires drink data from a certain website and outputs it to a CSV file.

environment

Fetch data from multiple pages

I quickly found a way to get data from one page, but how do I get it from multiple pages?

import requests
from bs4 import BeautifulSoup

import re

#Array to put URLs of multiple pages
urls_ary = []

#Search all a tags from the top page, get their href attributes, and add them to the array
url = 'http://hoge/top'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for a in soup.find_all('a', href=re.compile('^huga')):
  urls_ary.append(a.get('href'))

#Array to put drink data
drinks_ary = []

#Turn in a loop to access all pages
for page in urls_ary:
  url = 'http://hoge/top/'
  r = requests.get(url + str(page))
  soup = BeautifulSoup(r.text, 'lxml')
  #If the span tag contains the drink name, get the span tag
  tag = soup.find('span')
  drinks_ary.append(tag)

--By setting find_all ('a', href = re.compile ('^ huga')), all a tags whose links start with huga (<a'') I am trying to get only href = "huga ...> </a>"). --If you want to search all a tags, use find_all ('a') without this option. No need for `ʻimport re``.

I don't want to stop the program even in the event of an unexpected error

As mentioned above, I was sad that the process stopped due to an unexpected error while turning in a loop, and the program was turned from 1 again. Even if an error occurs, I want to ignore it for the time being and end the process. Exception handling can be done using try and `ʻexcept``.

for page in urls_ary:
  url = 'http://hoge/top/'
  r = requests.get(url + str(page))
  soup = BeautifulSoup(r.text, 'lxml')
  #Try on code that can cause errors, except
  try:
    tag = soup.find('span')
    drinks_ary.append(tag)
  #If an error occurs, skip this process and enter the next loop.
  except:
    continue

Unable to get text with .string

If you use .text, you can get it for the time being. For more information on the differences between .string and .text, see This article (Clear differences in string and text behavior in BeautifulSoup – Python). / languages / python / bs4-text-or-string /) was personally easy to understand.

I want to specify the nth tag

# <html>
# <body>
#   <ul>
#     <li>Not specified</li>
#     <li>Not specified</li>
#     <li>It is specified</li>
#     <li>Not specified</li>
#   </ul>
# </body>
# </html>

import requests
from bs4 import BeautifulSoup

url = 'http://hoge/top'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

#Third<li>Get tag
li_3rd = soup.select(body > ul > li:nth-of-type(3))

By the way, it can be specified by nth-child (), but the behavior is more complicated than nth-of-type.

# <html>
# <body>
#   <div>
#     <h1>Not specified</h1>
#     <p>Not specified</p>
#     <h2>Not specified</h2>
#     <p>specify</p>
#   </div>
# </body>
# </html>

#Second<p>I want to get a tag
# nth-of-type
soup.select('body > div > p:nth-of-type(2)')
# nth-child
soup.select('body > div > nth-child(4)')

It's easy to use the nth-of-type, because I want to get the second <p> tag inside the <div> tag. Let it be nth-of-type (2). On the other hand, the second <p> tag can be regarded as the fourth of the tags in <div>, so nth-child (4) Can also be specified.

css selector doesn't work as expected

If you check the html of the website using Chrome's developer tool (?) And specify the CSS selector based on it, you will rarely get an element different from the specified one. This is on the website side that there is a start tag of s, html but no close tag, and on the Chrome side that the <tbody> tag is inserted arbitrarily inside the <table> tag Often a problem. Since they cannot be identified from the developer tools, it is necessary to display the page source and check whether the closing tag is missing.

By the way, if the closing tag is insufficient, you should deal with it as follows.

#Missing div closing tag html
# <html>
# <body>
#   <div>
#     <h1>Not specified</h1>
#     <p>Not specified</p>
#     <h2>Not specified</h2>
#     <p>specify</p>
# </body>
# </html>

#Remove div tag
for tag_div in soup.find_all('div'):
  tag_div.unwrap()

tag_p = soup.select('body > p:nth-of-type(2)')

Barrier with python3, windows and character code

When I tried to save the retrieved data to a file in CSV format, I was troubled by `ʻUnicodeEncodeError``. This article (Causes and workarounds for UnicodeEncodeError on (Windows) Python 3) was especially helpful. There were many other articles that I googled and referred to, but I will omit them here because there are too many.

By the way, I was able to save it successfully with the following code.

import csv
import codecs

#An array containing the data you want to save
drinks_data = ['hoge', 'hogehoge', 'hugahuga']

#Save as CSV
f = codecs.open('data/sample.csv', 'wb', 'cp932', 'ignore')
writer = csv.writer(f)
writer.writerows(drinks_data)
f.close()

in conclusion

The above is a summary of what Python beginners have been stuck with for the first time scraping. I don't forget the sadness when the program that I ran because I misunderstood it was completed stopped with an error one hour later and started over again ... Exception handling ... Always be aware ...

Reference site

Recommended Posts

Python beginners get stuck with their first web scraping
Web scraping with Python First step
[For beginners] Try web scraping with Python
Web scraping with python + JupyterLab
Web scraping beginner with python
Web scraping with Python ① (Scraping prior knowledge)
I tried web scraping with python.
Get web screen capture with python
Get Qiita trends with Python scraping
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Get weather information with Python & scraping
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Get property information by scraping with python
Scraping with Python
WEB scraping with Python (for personal notes)
Scraping with Python
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
How Python beginners get started with Python with Progete
AWS-Perform web scraping regularly with Lambda + Python + Cron
The first web app created by Python beginners
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
[Cloud102] # 1 Get Started with Python (Part 1 Python First Steps)
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
Get date with python
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping RSS with Python
Beginners can use Python for web scraping (1) Improved version
[First API] Try to get Qiita articles with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell
Get country code with python
I tried scraping with Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Get Twitter timeline with python
Festive scraping with Python, scrapy
Save images with web scraping
Get Youtube data with python
First neuron simulation with NEURON + Python
Scraping with Tor in Python
Web API with Python + Falcon
Get thread ID with python
Web scraping using Selenium (Python)
Scraping weather forecast with python
Scraping with Selenium + Python Part 2
Get started with Python! ~ ② Grammar ~
Web application with Python + Flask ② ③
I tried scraping with python