[For beginners] Web scraping with Python "Access the URL in the page to get the contents"

Introduction

Last review

This is a continuation of the article [For beginners] Trying web scraping with Python. Last time, the electronic version of Nikkei Business https://business.nikkei.com/ I got the headline and URL of the new article from.

However, with this alone, you can find out by actually accessing this URL.

Purpose of this time

When you browse the news site, if you find any news that interests you, click it to see the details. Nikkei Business articles, though not all news, have an article introduction of about 150 characters that makes you want to read before the content. By displaying this content together, you can use it as a basis for deciding whether to read the article or not. It is difficult to access all the articles one by one and read the introductory text of the article. We will bring out the goodness of web scraping.

Review of the previous code

code.py


import requests
from bs4 import BeautifulSoup
import re

urlName = "https://business.nikkei.com"
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")

elems = soup.find_all("span")

for elem in elems: 
  try:
    string = elem.get("class").pop(0)
    if string in "category":
      print(elem.string)
      title = elem.find_next_sibling("h3")
      print(title.text.replace('\n',''))
      r = elem.find_previous('a')
      #I'm getting the URL of the article
      print(urlName + r.get('href'), '\n')

      #Write a program to get the article introduction text of the URL destination in this part

  except:
    pass

See the previous article for more details. When I clicked on the news, the URL to transition to was displayed and the last time was over. This time, access the URL to get the contents.

programming

First of all, this time we will make the requests and BeautifulSoup parts into functions.

subFunc.py


import requests
from bs4 import BeautifulSoup

def setup(url):
  url = requests.get(url)
  soup = BeautifulSoup(url.content, "html.parser")
  return url, soup

main.py


import re
import subFunc

urlName = "https://business.nikkei.com"
url, soup = subFunc.setup(urlName)

elems= soup.find_all("span")

for elem in elems: 
  try:
    string = elem.get("class").pop(0)
    if string in "category":
      print('\n', elem.string)

      title = elem.find_next_sibling("h3")
      print(title.text.replace('\n',''))

      r = elem.find_previous('a')
      nextPage = urlName + r.get('href')
      print(nextPage)
      
      #Newly written part from here
      nextUrl, nextSoup = subFunc.setup(nextPage)
      abst = nextSoup.find('p', class_="bplead")
      if len(abst) != 0:
        print(abst.get_text().replace('\n',''))
  except:
    pass

To be honest, what I do is the same. Get the information of the transition destination URL using requests and BeautifulSoup. In the introductory text of the article, class was in the element of bplead. However, some articles do not have an introductory text, so I tried to display them if they did.

The execution result is as follows. (Omitted)

Co-creation / competition / startup
The new corona is a long-term battle xxxxxxxxxxx
https://business.nikkei.com/atcl/gen/19/00101/040100009/    
He complained of the epidemic of the new coronavirus xxxxxxxxxxxx.

at the end

When I looked it up, some other methods were introduced, but I tried to get the contents of the transition destination with a simple method.

Recommended Posts

[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
[For beginners] Try web scraping with Python
Python beginners get stuck with their first web scraping
[Part.2] Crawling with Python! Click the web page to move!
Beginners use Python for web scraping (1)
Beginners use Python for web scraping (4) ―― 1
Get the source of the page to load infinitely with python.
[python, ruby] fetch the contents of a web page with selenium-webdriver
Output the contents of ~ .xlsx in the folder to HTML with Python
[Personal note] Web page scraping with python3
Get a capture of the entire web page in Selenium Python VBA
I tried to refer to the fun rock-paper-scissors poi for beginners with Python
How to get the date and time difference in seconds with python
The fastest way for beginners to master Python
[Python] Get the files in a folder with Python
Try to calculate RPN in Python (for beginners)
How to get the files in the [Python] folder
[Introduction for beginners] Working with MySQL in Python
I searched for the skills needed to become a web engineer in Python
How to get a list of files in the same directory with python
[Introduction to Python] How to get the index of data with a for statement
[For beginners] How to use say command in python!
Beginners can use Python for web scraping (1) Improved version
Data analysis for improving POG 1 ~ Web scraping with Python ~
How to get the number of digits in Python
Convert the image in .zip to PDF with Python
Get the result in dict format with Python psycopg2
I was addicted to scraping with Selenium (+ Python) in 2020
Try to get the contents of Word with Golang
Get the URL of the HTTP redirect destination in Python
Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell
Get an Access Token for your service account with the Firebase Admin Python SDK
After hitting the Qiita API with Python to get a list of articles for beginners, we will visit the god articles
Web scraping with python + JupyterLab
Web scraping notes in python3
Scraping with chromedriver in python
Scraping with Selenium in Python
Scraping with Tor in Python
Web scraping beginner with python
~ Tips for beginners to Python ③ ~
Try scraping the data of COVID-19 in Tokyo with Python
Test code to check for broken links in the page
Minimum knowledge to get started with the Python logging module
[For beginners] Summary of standard input in Python (with explanation)
[Ipdb] Web development beginners tried to summarize debugging with Python
How to get the last (last) value in a list in Python
I can't log in to the admin page with Django3
Tips for Python beginners to use the Scikit-image example for themselves
For beginners, how to deal with common errors in keras
How to get into the python development environment with Vagrant
Save images on the web to Drive with Python (Colab)
[Introduction to Python] How to get data with the listdir function
I made a class to get the analysis result by MeCab in ndarray with python
Recursively get the Excel list in a specific folder with python and write it to Excel.
I tried to solve the ant book beginner's edition with python
I want to monitor UNIQLO + J page updates [Scraping with python]
Link to get started with python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Web scraping with BeautifulSoup4 (layered page)
Get the desktop path in Python