[For beginners] Try web scraping with Python

Reader's assumption

It is said that it is for beginners, but I am also a beginner. After understanding the simple sample code of web scraping, I wanted to show my own originality, so I tried it while investigating. When I executed it according to the reference code of web scraping, I was able to extract the title and so on! If is level 1, I think that this time it is about level 2. So, I think there may be some misunderstandings, so if you have any suggestions, please comment.

Introduction

environment

python 3.7.3 I developed it with visual studio code.

Library import

Python has an HTTP library called "urlib2", but it's not easy to use, so I use the "Requests" and "BeautifulSoup" libraries for web scraping. Get the web page with Requests and extract its HTML with Beautiful Soup.

Let's do it!

Scraping content

Nikkei Business Electronic Edition https://business.nikkei.com/ I will try to get the headline and URL of the new article from.

Access with Google Chrome and press F12 to access the developer tools (verification mode).

I want to know which part of the HTML the new article part is, so press Ctrl + Shift + C to move the cursor to the headline.

コメント 2020-03-31 212354.jpg コメント 2020-03-31 212442.jpg Then, I found that the serialized name of the article is in the part where the class is category. コメント 2020-03-31 212502.jpg コメント 2020-03-31 220521.jpg

The article headline is in the h3 tag. Also, you can see that the URL is in the a tag part a little above. The composition of this relationship is as follows. Later, I would like to explain it together with the program.

図1.png

Code description

code.py


import requests
from bs4 import BeautifulSoup
import re

urlName = "https://business.nikkei.com"
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")

Make an http connection with the requests library and analyze html with Beautiful Soup.

code.py


elems = soup.find_all("span")

First, store all span elements in elems.

code.py


for elem in elems: 
  try:
    string = elem.get("class").pop(0)
    if string in "category":
      print(elem.string)
      title = elem.find_next_sibling("h3")
      print(title.text.replace('\n',''))
      r = elem.find_previous('a')
      print(urlName + r.get('href'), '\n')
  except:
    pass

Next, extract the class name from the span element to determine if it is a category. If the class is category, the text of the serial name is extracted using .string.

Then, the next step is to get the contents of the heading. The heading was on the h3 tag. The h3 tag was at the same depth, just below. So use find_next_sibling () to find h3 at the same depth after the element.

図2.png

The extracted text may also have an image, and it may or may not include line breaks, so I deleted it if it did.

Finally, I would like to extract the URL. It was the same depth earlier, but the a tag is one depth higher. So I used find_previous () to look for the a tag and used the get method to get the specified attribute value of the element to get the address of the href.

図3.png

Below are some of the execution results.

Yuka Ikematsu's direct flight from New York
A huge hospital ship of the US Navy enters NY. Still not enough beds
https://business.nikkei.com/atcl/gen/19/00119/033100011/

Yohei Ichishima's Silicon Valley Insai ...
Living the "20% Demand Economy" Post-Corona Thinking and Moving US Food Service Industry
https://business.nikkei.com/atcl/gen/19/00137/033100002/

Muneaki Hashimoto looks ahead of medicine and medical care
Shionogi, President Teshiroki's conspiracy refrain from partnering with Ping An Insurance
https://business.nikkei.com/atcl/gen/19/00110/033100012/ 

In this way, I was able to get it.

at the end

I'm still studying, so I'm wondering if there are any misunderstandings or better ways. I would like to practice it while deepening my understanding little by little.

Recommended Posts

[For beginners] Try web scraping with Python
Beginners use Python for web scraping (1)
Try scraping with Python.
WEB scraping with Python (for personal notes)
Web scraping with python + JupyterLab
Web scraping beginner with python
Beginners can use Python for web scraping (1) Improved version
Data analysis for improving POG 1 ~ Web scraping with Python ~
Beginners use Python for web scraping (4) --2 Scraping on Cloud Shell
Try scraping with Python + Beautiful Soup
Web scraping with Python ① (Scraping prior knowledge)
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python
Scraping with Python
(For beginners) Try creating a simple web API with Django
INSERT into MySQL with Python [For beginners]
Getting Started with Python Web Scraping Practice
[Personal note] Web page scraping with python3
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Getting Started with Python Web Scraping Practice
Try HTML scraping with a Python library
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
[Python] Read images with OpenCV (for beginners)
WebApi creation with Python (CRUD creation) For beginners
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Preparation for scraping with python [Chocolate flavor]
Scraping with Python (preparation)
Scraping with Python + PhantomJS
python textbook for beginners
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
OpenCV for Python beginners
Scraping RSS with Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
Causal reasoning and causal search with Python (for beginners)
~ Tips for Python beginners from Pythonista with love ① ~
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
~ Tips for Python beginners from Pythonista with love ② ~
[Introduction for beginners] Working with MySQL in Python
I tried scraping with Python
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Learning flow for Python beginners
Festive scraping with Python, scrapy
Try Python output with Haxe 3.2
Save images with web scraping
Scraping with Tor in Python
Web API with Python + Falcon
Python #function 2 for super beginners
WEB scraping with python and try to make a word cloud from reviews
Web scraping using Selenium (Python)
Scraping weather forecast with python
Try running Python with Try Jupyter
Basic Python grammar for beginners
Scraping with Selenium + Python Part 2