About this article

As I wrote in Qiita before, I wrote Code for scraping websites in Java. Looking back now, it's hard to say that the code content is clean, although it meets the requirements. I was embarrassed to see it, so I decided to rewrite it in Python, so make a note.

There are many similar articles in Qiita, but it is a memorandum.

About Beautiful Soup

I used to use a library called jsoup when scraping with Java. This time we will use ** Beautiful Soup **.

BeautifulSoup is a library for scraping Python. Since you can extract the elements in the page using the CSS selector, it is convenient to extract only the desired data in the page. Official: https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Since it is a Python library, it is installed with pip.

pip install beautifulsoup4

Example of use

Like the article I wrote before, I want to extract the date, title, and URL of "Notice" from the following page.

<body> 
 <div class="section"> 
  <div class="block"> 
   <dl>
    <dt>2019.08.04</dt> 
    <dd>
     <a href="http://www.example.com/notice/0003.html">Notice 3</a>
    </dd> 
    <dt>2019.08.03</dt> 
    <dd>
     <a href="http://www.example.com/notice/0002.html">Notice 2</a>
    </dd> 
    <dt>2019.08.02</dt> 
    <dd>
     <a href="http://www.example.com/notice/0001.html">Notice 1</a>
    </dd> 
   </dl>
  </div>
 </div>
</body>

Extract the notification with the following code and print it.

`scraping.py`


# -*- coding: utf-8 -*-
import requests
import sys
from bs4 import BeautifulSoup
from datetime import datetime as d

def main():

    print("Scraping Program Start")

    #Send a GET request to the specified URL to get the contents of the page
    res=requests.get('http://www.example.com/news.html')

    #Parse the retrieved HTML page into a BeautifulSoup object
    soup = BeautifulSoup(res.text, "html.parser")

    #Extract the entire block class element in the page
    block = soup.find(class_="block")

    #Extract dt element (date) and dd element in block class
    dt = block.find_all("dt")
    dd = block.find_all("dd")

    if(len(dt) != len(dd)):
        print("ERROR! The number of DTs and DDs didn't match up.")
        print("Scraping Program Abend")
        sys.exit(1)

    newsList = []

    for i in range(len(dt)):
        try:
            date = dt[i].text
            title = dd[i].find("a")
            url = dd[i].find("a").attrs['href']

            print("Got a news. Date:" + date +", title:" + title.string + ", url:" + url)

        except:
            print("ERROR! Couldn't get a news.")
            pass

    print("Scraping Program End")

if __name__ == "__main__":
    main()

The expected result when executing the above code is as follows.

Scraping Program Start
Got a news. Date:2019.08.04, title:Notice 3, url:http://www.example.com/notice/0003.html
Got a news. Date:2019.08.03, title:Notice 2, url:http://www.example.com/notice/0002.html
Got a news. Date:2019.08.04, title:Notice 1, url:http://www.example.com/notice/0001.html
Scraping Program End

in conclusion

Compared to the last time I wrote in Java's Spring Boot, it's good that the amount of coding is overwhelmingly small in Python. Please point out any mistakes in the content.

Website scraping with Python's Beautiful Soup

About this article

About Beautiful Soup

Example of use

scraping.py

in conclusion

`scraping.py`