Introduction

When I tried web scraping without even knowing the grammar of Python, I got stuck in various ways, so I will summarize it with a memorandum. The implementation is a program that acquires drink data from a certain website and outputs it to a CSV file.

environment

windows 10
Anaconda 3
python 3.7.3
BeautifulSoup 4.8.2

Fetch data from multiple pages

I quickly found a way to get data from one page, but how do I get it from multiple pages?

import requests
from bs4 import BeautifulSoup

import re

#Array to put URLs of multiple pages
urls_ary = []

#Search all a tags from the top page, get their href attributes, and add them to the array
url = 'http://hoge/top'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
for a in soup.find_all('a', href=re.compile('^huga')):
  urls_ary.append(a.get('href'))

#Array to put drink data
drinks_ary = []

#Turn in a loop to access all pages
for page in urls_ary:
  url = 'http://hoge/top/'
  r = requests.get(url + str(page))
  soup = BeautifulSoup(r.text, 'lxml')
  #If the span tag contains the drink name, get the span tag
  tag = soup.find('span')
  drinks_ary.append(tag)

--By setting find_all ('a', href = re.compile ('^ huga')), all a tags whose links start with huga (<a'') I am trying to get only href = "huga ...> </a>"). --If you want to search all a tags, use find_all ('a') without this option. No need for `ʻimport re``.

I don't want to stop the program even in the event of an unexpected error

As mentioned above, I was sad that the process stopped due to an unexpected error while turning in a loop, and the program was turned from 1 again. Even if an error occurs, I want to ignore it for the time being and end the process. Exception handling can be done using try and `ʻexcept``.

for page in urls_ary:
  url = 'http://hoge/top/'
  r = requests.get(url + str(page))
  soup = BeautifulSoup(r.text, 'lxml')
  #Try on code that can cause errors, except
  try:
    tag = soup.find('span')
    drinks_ary.append(tag)
  #If an error occurs, skip this process and enter the next loop.
  except:
    continue

Unable to get text with `.string`

If you use .text, you can get it for the time being. For more information on the differences between .string and .text, see This article (Clear differences in string and text behavior in BeautifulSoup – Python). / languages / python / bs4-text-or-string /) was personally easy to understand.

I want to specify the nth tag

# <html>
# <body>
#   <ul>
#     <li>Not specified</li>
#     <li>Not specified</li>
#     <li>It is specified</li>
#     <li>Not specified</li>
#   </ul>
# </body>
# </html>

import requests
from bs4 import BeautifulSoup

url = 'http://hoge/top'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')

#Third<li>Get tag
li_3rd = soup.select(body > ul > li:nth-of-type(3))

By the way, it can be specified by nth-child (), but the behavior is more complicated than nth-of-type.

# <html>
# <body>
#   <div>
#     <h1>Not specified</h1>
#     <p>Not specified</p>
#     <h2>Not specified</h2>
#     <p>specify</p>
#   </div>
# </body>
# </html>

#Second<p>I want to get a tag
# nth-of-type
soup.select('body > div > p:nth-of-type(2)')
# nth-child
soup.select('body > div > nth-child(4)')

It's easy to use the nth-of-type, because I want to get the second <p> tag inside the <div> tag. Let it be nth-of-type (2). On the other hand, the second <p> tag can be regarded as the fourth of the tags in <div>, so nth-child (4) Can also be specified.

css selector doesn't work as expected

If you check the html of the website using Chrome's developer tool (?) And specify the CSS selector based on it, you will rarely get an element different from the specified one. This is on the website side that there is a start tag of s, html but no close tag, and on the Chrome side that the <tbody> tag is inserted arbitrarily inside the <table> tag Often a problem. Since they cannot be identified from the developer tools, it is necessary to display the page source and check whether the closing tag is missing.

By the way, if the closing tag is insufficient, you should deal with it as follows.

#Missing div closing tag html
# <html>
# <body>
#   <div>
#     <h1>Not specified</h1>
#     <p>Not specified</p>
#     <h2>Not specified</h2>
#     <p>specify</p>
# </body>
# </html>

#Remove div tag
for tag_div in soup.find_all('div'):
  tag_div.unwrap()

tag_p = soup.select('body > p:nth-of-type(2)')

Barrier with python3, windows and character code

When I tried to save the retrieved data to a file in CSV format, I was troubled by `ʻUnicodeEncodeError``. This article (Causes and workarounds for UnicodeEncodeError on (Windows) Python 3) was especially helpful. There were many other articles that I googled and referred to, but I will omit them here because there are too many.

By the way, I was able to save it successfully with the following code.

import csv
import codecs

#An array containing the data you want to save
drinks_data = ['hoge', 'hogehoge', 'hugahuga']

#Save as CSV
f = codecs.open('data/sample.csv', 'wb', 'cp932', 'ignore')
writer = csv.writer(f)
writer.writerows(drinks_data)
f.close()

in conclusion

The above is a summary of what Python beginners have been stuck with for the first time scraping. I don't forget the sadness when the program that I ran because I misunderstood it was completed stopped with an error one hour later and started over again ... Exception handling ... Always be aware ...

Reference site

https://lets-hack.tech/programming/languages/python/bs4-text-or-string/
https://qiita.com/butada/items/33db39ced989c2ebf644

Python beginners get stuck with their first web scraping