[PYTHON] Maybe it works! Let's try scraping!

Introduction

I wanted to do scraping, but I didn't touch it, but I finally got up and touched it, and summarized what I investigated at that time. I hope it will be helpful for those who want to start web scraping.

environment


$ python --version
Python 3.7.4
$ conda --version
conda 4.7.12
$ pip -V
pip 19.2.3 from ■■■■■■■■■■■■■■■■■■■ (python 3.7)

Installation

This time, we will scrape using anaconda. If you need it, please install it. anaconda for Windows anaconda Mac version

If you do not want to install anaconda, please install the following libraries from pip. (It is installed as standard in anaconda.)


$ pip install requests
$ pip install beautifulsoup4
$ pip install lxml

About the library

Requests

An HTML communication library that makes it easy to obtain HTML / XML information and images of websites. An HTTP communication library called urllib is installed in the Python standard library, but the API cannot be used properly ** (Requests official view) **. ** **

Beautiful Soup 4

An HTML parser library that extracts specific text from the acquired HTML / XML.

lxml

HTML parser. Python has html.parser as a standard library, but lxml seems to be easier to use.

Get started with web scraping

This time, I will introduce how to get the title of the main news of Yahoo News as an example. The sample code is put together at the end of the article.

Create a 1 py file and import requests and BeautifulSoup4 with the following description.


    import requests
    from bs4 import BeautifulSoup

2 Set the URL and get the HTML with Requests.


    url = 'https://news.yahoo.co.jp/'
    rq = requests.get(url)

3 Create an object for HTML parsing.


    bs = BeautifulSoup(rq.text, 'lxml')

4 Use Chrome's developer tools to find what you need.

The part common to the titles of major news is surrounded by li tags, and the class name is "topicsListItem".

Cut out the necessary part from HTML based on the conditions found in 5 4.


    newsList = bs.find_all("li", class_="topicsListItem")

In this state, each li tag is acquired.

6 I want only the text part from the cut list, so I get_text () and then output it.


    for news in newsList:
      print(news.get_text())

result

Benefit fraud A series of self-consultations
GoTo Inns where reservations for Tokyo residents are made one after another
U.S. President Yemen sentenced to death
Violent coverage Tokyo Shimbun apologizes
Recover from docomo system failure
Last Yuru Chara GP Tohoku First V
Sprint GI Gran Alegria V
Mikako Tabe I don't expect anyone

You got it properly!

Sample code


import requests
from bs4 import BeautifulSoup
    
url = 'https://news.yahoo.co.jp/'
rq = requests.get(url)
    
bs = BeautifulSoup(rq.text, 'lxml')
newsList = bs.find_all("li", class_="topicsListItem")
    
for news in newsList:
  print(news.get_text())

Recommended Posts

Maybe it works! Let's try scraping!
Try scraping with Python.
Let's try a shell script
Maybe it can be recursed