As I wrote in Qiita before, I wrote Code for scraping websites in Java. Looking back now, it's hard to say that the code content is clean, although it meets the requirements. I was embarrassed to see it, so I decided to rewrite it in Python, so make a note.
There are many similar articles in Qiita, but it is a memorandum.
I used to use a library called jsoup when scraping with Java. This time we will use ** Beautiful Soup **.
BeautifulSoup is a library for scraping Python. Since you can extract the elements in the page using the CSS selector, it is convenient to extract only the desired data in the page. Official: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Since it is a Python library, it is installed with pip.
pip install beautifulsoup4
Like the article I wrote before, I want to extract the date, title, and URL of "Notice" from the following page.
<body> 
 <div class="section"> 
  <div class="block"> 
   <dl>
    <dt>2019.08.04</dt> 
    <dd>
     <a href="http://www.example.com/notice/0003.html">Notice 3</a>
    </dd> 
    <dt>2019.08.03</dt> 
    <dd>
     <a href="http://www.example.com/notice/0002.html">Notice 2</a>
    </dd> 
    <dt>2019.08.02</dt> 
    <dd>
     <a href="http://www.example.com/notice/0001.html">Notice 1</a>
    </dd> 
   </dl>
  </div>
 </div>
</body>
Extract the notification with the following code and print it.
scraping.py
# -*- coding: utf-8 -*-
import requests
import sys
from bs4 import BeautifulSoup
from datetime import datetime as d
def main():
    print("Scraping Program Start")
    #Send a GET request to the specified URL to get the contents of the page
    res=requests.get('http://www.example.com/news.html')
    #Parse the retrieved HTML page into a BeautifulSoup object
    soup = BeautifulSoup(res.text, "html.parser")
    #Extract the entire block class element in the page
    block = soup.find(class_="block")
    #Extract dt element (date) and dd element in block class
    dt = block.find_all("dt")
    dd = block.find_all("dd")
    if(len(dt) != len(dd)):
        print("ERROR! The number of DTs and DDs didn't match up.")
        print("Scraping Program Abend")
        sys.exit(1)
    newsList = []
    for i in range(len(dt)):
        try:
            date = dt[i].text
            title = dd[i].find("a")
            url = dd[i].find("a").attrs['href']
            print("Got a news. Date:" + date +", title:" + title.string + ", url:" + url)
        except:
            print("ERROR! Couldn't get a news.")
            pass
    print("Scraping Program End")
if __name__ == "__main__":
    main()
The expected result when executing the above code is as follows.
Scraping Program Start
Got a news. Date:2019.08.04, title:Notice 3, url:http://www.example.com/notice/0003.html
Got a news. Date:2019.08.03, title:Notice 2, url:http://www.example.com/notice/0002.html
Got a news. Date:2019.08.04, title:Notice 1, url:http://www.example.com/notice/0001.html
Scraping Program End
Compared to the last time I wrote in Java's Spring Boot, it's good that the amount of coding is overwhelmingly small in Python. Please point out any mistakes in the content.
Recommended Posts