Beginners use Python for web scraping (1)

I don't know the K character in the cloud or the P character in Python, It's been a month since I started studying Python + GCP. Began to be interested in web scraping in Python How to use requests, various attributes of requests object, While learning html parsing with BeutifuruSoup First of all, I will try scraping Yahoo News.

Roadmap for learning web scraping in Python

(1) Succeed in scraping the desired stuff locally for the time being. ← Now here (2) Link the result of scraping locally to Google Spreadsheet. (3) cron is automatically executed locally. (4) Challenge free automatic execution on the cloud server. (Google Compute Engine) (5) Challenge free automatic execution without a server on the cloud. (Maybe Cloud Functions + Cloud Scheduler)

Functions of sample PGM (1)

・ Get website information using requests ・ Parse html with Beautiful Soup -Search for a specific character string with the re library that can search for character strings (identify headline news) -Display all news titles and links from the acquired result list on the console

What are requests?

An external library for HTTP communication with Python. You can simply collect information on the website. You can also get the url using urllib, the standard python library, If you use requests, the amount of code is small and you can write it simply. However, since it is a third-party library, it needs to be installed.

Install requests

It can be installed with pip. Here is the clean state of the virtual environment created with venv.

bash


$ virtualenv -p python3.7 env3
% source env3/bin/activate
(env3) % pip list
Package    Version
---------- -------
pip        20.2.3
setuptools 49.2.1
wheel      0.34.2

Install with pip. Check the pip list to see if it's in (and version). Along with that, various things are also included.

bash


(env3) % pip install requests
Collecting requests
  Using cached requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting chardet<4,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Using cached urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2020.6.20-py2.py3-none-any.whl (156 kB)
Installing collected packages: idna, chardet, urllib3, certifi, requests
Successfully installed certifi-2020.6.20 chardet-3.0.4 idna-2.10 requests-2.24.0 urllib3-1.25.10
(env3) % pip list
Package    Version
---------- ---------
certifi    2020.6.20
chardet    3.0.4
idna       2.10
pip        20.2.3
requests   2.24.0
setuptools 49.2.1
urllib3    1.25.10
wheel      0.34.2

requests method

requests is a common HTTP request method, It supports methods such as get, post, put, delete. This time we will use get.

Attributes of the response object for requests

The response object returned by requests.get contains various attributes. In this sample program, the following attributes were confirmed by print.

attribute What can be confirmed
url You can get the accessed URL.
status_code Status code(HTTP status)Can be obtained.
headers You can get the response header.
encoding You can get the encoding that Requests guessed.

In addition, there are text attribute and content attribute.

The headers attribute is a dict type (dictionary), and Yahoo News contains many keys as shown below, so in the sample program, the'Content-Type' key is extracted from the headers attribute and printed.

bash


{'Cache-Control': 'private, no-cache, no-store, must-revalidate', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=UTF-8', 'Date': 'Wed, 09 Sep 2020 02:24:04 GMT', 'Set-Cookie': 'B=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp, XB=6rffcc5flgf64&b=3&s=sv; expires=Sat, 10-Sep-2022 02:24:04 GMT; path=/; domain=.yahoo.co.jp; secure; samesite=none', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Frame-Options': 'DENY', 'X-Vcap-Request-Id': 'd130bb1e-4e53-4738-4b02-8419633dd825', 'X-Xss-Protection': '1; mode=block', 'Age': '0', 'Server': 'ATS', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Via': 'http/1.1 edge2821.img.kth.yahoo.co.jp (ApacheTrafficServer [c sSf ])'}

Source of requests.get part

Click here for the source excerpt of requests.get and each attribute display part of the acquired response object.

python


url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding

Here are the results.

bash


(env3) % python requests-test.py
url:  https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding:  UTF-8

What is Beautiful Soup?

Beautiful Soup is a library for web scraping in Python. You can retrieve and parse data from HTML and XML files. You can easily extract a specific html tag.

installation of beautifulsoup4

Same as requests. It can be installed with pip.

bash


(env3) % pip install beautifulsoup4
Collecting beautifulsoup4
  Using cached beautifulsoup4-4.9.1-py3-none-any.whl (115 kB)
Collecting soupsieve>1.2
  Using cached soupsieve-2.0.1-py3-none-any.whl (32 kB)
Installing collected packages: soupsieve, beautifulsoup4
Successfully installed beautifulsoup4-4.9.1 soupsieve-2.0.1
(env3) % pip list                  
Package        Version
-------------- ---------
beautifulsoup4 4.9.1
certifi        2020.6.20
chardet        3.0.4
idna           2.10
pip            20.2.3
requests       2.24.0
setuptools     49.2.1
soupsieve      2.0.1
urllib3        1.25.10
wheel          0.34.2

Beautiful Soup arguments

In Beautiful Soup, the object to be analyzed (html or xml) is the first argument. (The response object obtained by requests in the sample) Specify the parser to be used for analysis as the second argument.

Parser Example of use Strengths weakness
Python’s html.parser BeautifulSoup(response.text, "html.parser") Standard library Python2 series/3.2.Not compatible with less than 2
lxml’s HTML parser BeautifulSoup(response.text, "lxml") Detonation velocity install required
lxml’s XML parser BeautifulSoup(response.text, "xml") Detonation velocity. Only xml parser install required
html5lib BeautifulSoup(response.text, "html5lib") Can handle HTML5 correctly install required. Very slow.

python


soup = BeautifulSoup(response.text, "html.parser")

BeautifulSoup has various methods, but this time we will use the find_all method. You can also set various arguments to the find_all method, but this time we will use keyword arguments.

find_all: keyword argument

You can specify the tag attribute as a keyword argument and get the information of the matching tag.

The value of the keyword argument can also be a string, regular expression, list, function, True value. And you can specify multiple keyword arguments.

For example, if you pass a value to href as a keyword argument, Beautiful Soup will filter the href attribute of the HTML tag.

Quote: https://ai-inter1.com/beautifulsoup_1/#find_all_detail

In other words, "the value of the href attribute matches the specified regular expression", By find_all from the soup object, in the example below, href属性の中で"news.yahoo.co.jp/pickup"が含まれているもののみ全て抽出することが可能となります。

elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

Final sample source

At the end, turn it with a for statement and display the title and link of the extracted news on the console. Click here for the final sample source.

requests-test.py


import requests
from bs4 import BeautifulSoup
import re

#Download website information using requests
url = 'https://news.yahoo.co.jp/'
response = requests.get(url)
#print(response.text)
print('url: ',response.url)
print('status-code:',response.status_code) #HTTP status code, usually[200 OK]
print('headers[Content-Type]:',response.headers['Content-Type']) #Since headers is a dictionary, you can specify the key to content-type output
print('encoding: ',response.encoding) #encoding

#BeautifulSoup()Website information and parser acquired in"html.parser"give
soup = BeautifulSoup(response.text, "html.parser")

#In the href attribute"news.yahoo.co.jp/pickup"Extract only those that contain
elems = soup.find_all(href = re.compile("news.yahoo.co.jp/pickup"))

#The title and link of the extracted news are displayed on the console.
for elem in elems:
    print(elem.contents[0])
    print(elem.attrs['href'])

The part of PGM is close to the plagiarism of the site posted on the reference site. It was a great reference.

Afterword

Except for the print and import part of the response object of requests for confirmation, You can do web scraping with just 7 lines. Python and its predecessor's library, terrifying.

Click here for the results. I was able to scrape for the time being! The last news with a photo is superfluous, but I don't know what to do, so I'll leave it as it is. .. ..

bash


% python requests-test.py
url:  https://news.yahoo.co.jp/
status-code: 200
headers[Content-Type]: text/html;charset=UTF-8
encoding:  UTF-8
Docomo account cooperation silver majority suspension
https://news.yahoo.co.jp/pickup/6370639
Mr. Suga Corrected remarks about the Self-Defense Forces
https://news.yahoo.co.jp/pickup/6370647
Flooded strawberry farmer suffering for 3 consecutive years
https://news.yahoo.co.jp/pickup/6370631
Two people died when four people got on the sea
https://news.yahoo.co.jp/pickup/6370633
Mulan shooting in Xinjiang Repulsion again
https://news.yahoo.co.jp/pickup/6370640
Parents suffer from prejudice panic disorder
https://news.yahoo.co.jp/pickup/6370643
Taku Hiraoka Defendant imprisonment for 2 years and 6 months
https://news.yahoo.co.jp/pickup/6370646
Iseya suspect seized 500 rolls
https://news.yahoo.co.jp/pickup/6370638
<span class="topics_photo_img" style="background-image:url(https://lpt.c.yimg.jp/amd/20200909-00000031-asahi-000-view.jpg)"></span>
https://news.yahoo.co.jp/pickup/6370647

Reference site: https://requests-docs-ja.readthedocs.io/en/latest/ https://ai-inter1.com/beautifulsoup_1/ http://kondou.com/BS4/

Recommended Posts

Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Beginners can use Python for web scraping (1) Improved version
Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell
[For beginners] Try web scraping with Python
Beginners use Python for web scraping (4) -3 GCE VM instance creation and scraping on VM
WEB scraping with Python (for personal notes)
python textbook for beginners
Python web scraping selenium
OpenCV for Python beginners
Web scraping with python + JupyterLab
Web scraping notes in python3
Learning flow for Python beginners
Python3 environment construction (for beginners)
Python #function 2 for super beginners
Basic Python grammar for beginners
100 Pandas knocks for Python beginners
Python for super beginners Python #functions 1
[Python + Selenium] Tips for scraping
Python #list for super beginners
Web scraping beginner with python
~ Tips for beginners to Python ③ ~
[For beginners] How to use say command in python!
Data analysis for improving POG 1 ~ Web scraping with Python ~
Python beginners get stuck with their first web scraping
Tips for Python beginners to use Scikit-image examples for themselves 4 Use GUI
Web scraping with Python ① (Scraping prior knowledge)
Web teaching materials for learning Python
Python for super beginners Python # dictionary type 1 for super beginners
Web scraping with Python First step
I tried web scraping with python.
What is scraping? [Summary for beginners]
Next, use Python (Flask) for Heroku!
[Scraping] Python scraping
<For beginners> python library <For machine learning>
Python #len function for super beginners
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Tips for Python beginners to use the Scikit-image example for themselves
Web scraping for weather warning notifications.
Run unittests in Python (for beginners)
Python #Hello World for super beginners
web scraping
Python for super beginners Python # dictionary type 2 for super beginners
[Python] Minutes of study meeting for beginners (7/15)
Use DeepL with python (for dissertation translation)
[Beginner] Python web scraping using Google Colaboratory
Getting Started with Python Web Scraping Practice
Let's put together Python for super beginners
[Personal note] Web page scraping with python3
[Python] Organizing how to use for statements
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
[Python] Web application design for machine learning
[Python] Read images with OpenCV (for beginners)
How to use "deque" for Python data
WebApi creation with Python (CRUD creation) For beginners
Use pathlib in Maya (Python 2.7) for upcoming Python 3.7
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Atcoder standard input set for beginners (python)