Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.

Introduction

Since I was a university student, I have been working on acquiring stock prices and news articles using a PC in the laboratory. Recently, however, it has become necessary to take on the challenge of acquiring and accumulating ** "English" ** news articles at work.

So, let's try to realize the process of getting "English" news articles in a Python program. This time, the news source is ** Reuters **.

What to introduce in this article

--Obtaining headlines (titles, summaries) from Reuters --Getting the article text from Reuters

Based on the code described in the link below, we added the code to get the article text that is the link destination of "NEWS HEADLINES".

How to scrape news headlines from Reuters? Business News Headlines

In addition, the author has confirmed the operation with the following version.

Not introduced in this article

--How to install and use the Python library

For the installation of Selenium, I referred to the following article. [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified)

Sample code

Since the amount of Code is not large, I will introduce the entire Code. There are two points.

1. Explicit wait

It is a must to implement standby processing (Sleep) even in ** because it does not impose a load on the access destination **. And, assuming that it takes time for the URL (page) to be loaded by the Web browser, it is better to implement the wait process.

I referred to the following article. [Python] Selenium usage memo Story of standby processing with Selenium Three settings to make for stable operation of Selenium (also supports Headless mode)

2. Specifying tag elements

It is a must to look at the Source of each page, specify the element in consideration of the tag structure, and acquire the information with Selenium or BeautifulSoup4. This time, the headline is Selenium and the article text is BeautifulSoup4.

Introducing Code

The part processed using Selenium is almost the same as Reference Code. It is a form that additionally implements the process of acquiring the link (href attribute) of each article body and the process of acquiring the article body.

When you run the code, the CSV file will be output to the folder specified in ** outputdirpath **. (CSV file is page by page) I'm a little worried that I didn't seriously implement the handling of Error and character code.

crawler_reuters.py


import chromedriver_binary
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import dateutil.parser
import time
import os
import datetime
import csv
import io
import codecs
import requests
from bs4 import BeautifulSoup

'''
#Below, for the workplace or internal network (proxy environment).(2020/11/02 Update)
os.environ["HTTP_PROXY"] = "http://${Proxy server IP address}:${Proxy server port number}/"
os.environ["HTTPS_PROXY"] = "http://${Proxy server IP address}:${Proxy server port number}/"
'''

def createOutputDirpath():
  workingdirpath = os.getcwd()
  outputdirname = 'article_{0:%Y%m%d}'.format(datetime.datetime.now())
  outputdirpath = "..\\data\\%s" %(outputdirname)
  if not os.path.exists(os.path.join(workingdirpath, outputdirpath)):
    os.mkdir(os.path.join(workingdirpath, outputdirpath))
  return os.path.join(workingdirpath, outputdirpath)

def getArticleBody(url):
  html = requests.get(url)
  #soup = BeautifulSoup(html.content, "html.parser")
  soup = BeautifulSoup(html.content, "lxml")
  wrapper = soup.find("div", class_="ArticleBodyWrapper")
  paragraph = [element.text for element in wrapper.find_all("p", class_="Paragraph-paragraph-2Bgue")]
  #paragraph = []
  #for element in wrapper.find_all("p", class_="Paragraph-paragraph-2Bgue"):
  #  paragraph.append(element.text)
  return paragraph

outputdirpath = createOutputDirpath()
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.reuters.com/news/archive/businessnews?view=page&page=5&pageSize=10')

count = 0
for x in range(5):
  try:
    print("=====")
    print(driver.current_url)
    print("-----")
    #f = open(os.path.join(outputdirpath, "reuters_news.csv"), "w", newline = "")
    f = codecs.open(os.path.join(outputdirpath, "reuters_news_%s.csv" %(x)), "w", "UTF-8")
    writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_ALL, quotechar="\"")
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "control-nav-next")))
    loadMoreButton = driver.find_element_by_class_name("control-nav-next") # or "control-nav-prev"
    # driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    #news_headlines = driver.find_elements_by_class_name("story-content")
    news_headlines = driver.find_elements_by_class_name("news-headline-list")[0].find_elements_by_class_name("story-content")
    for headline in news_headlines:
      #print(headline.text)
      #print(headline.get_attribute("innerHTML"))
      href = headline.find_element_by_tag_name("a").get_attribute("href")
      title = headline.find_element_by_class_name("story-title").text
      smry = headline.find_element_by_tag_name("p").text
      stmp = headline.find_element_by_class_name("timestamp").text
      body = getArticleBody(href)
      print(href)
      #print(title)
      #print(smry)
      #print(stmp)
      #print(body)      
      writer.writerow([href, title, smry, stmp, '\r\n'.join(body)])
      time.sleep(1)
    f.close()
    count += 1
    loadMoreButton.click()
    time.sleep(10)
  except Exception as e:
    print(e)
    break

After all it is convenient, Python. Let's change the URL parameters of Reuters (page number and number of articles per page) and use it at work.

But is the Java version of Selenium easier to use? .. .. ??

Summary

Introducing how to get (crawling) news articles (Reuters articles) using Selenium and BeautifulSoup4.

Recommended Posts

Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.
Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4
I tried to get Web information using "Requests" and "lxml"
[Python scraping] I tried google search top10 using Beautifulsoup & selenium
I tried to make a periodical process with Selenium and Python
[Python] I tried to get various information using YouTube Data API!
I tried to create a sample to access Salesforce using Python and Bottle
I tried object detection using Python and OpenCV
I tried to get CloudWatch data with Python
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
I tried to get started with blender python script_Part 01
I tried to get started with blender python script_Part 02
I tried to get an AMI using AWS Lambda
Get and automate ASP Datepicker control using Python and Selenium
[Python] I tried to get Json of squid ring 2
I tried to access Google Spread Sheets using Python
I want to get custom data attributes of html as elements using Python Selenium
I tried to automate the article update of Livedoor blog with Python and selenium.
Start to Selenium using python
[Python] I tried using OpenPose
I tried "How to get a method decorated in Python"
I tried programming the chi-square test in Python and Java.
I tried to enumerate the differences between java and python
I tried to make a stopwatch using tkinter in python
I tried to make GUI tic-tac-toe with Python and Tkinter
I tried to get data from AS / 400 quickly using pypyodbc
I tried to get the number of days of the month holidays (Saturdays, Sundays, and holidays) with python
I tried to touch Python (installation)
[Introduction to Python3 Day 1] Programming and Python
I tried using Thonny (Python / IDE)
Selenium and python to open google
[Python] I tried using YOLO v3
I tried to get a database of horse racing using Pandas
I tried to get the index of the list using the enumerate function
A memorandum when I tried to get it automatically with selenium
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "time" using Python
[Python] A memo that I tried to get started with asyncio
I tried to make a regular expression of "date" using Python
I tried to get a list of AMI Names using Boto3
Get and set the value of the dropdown menu using Python and Selenium
I tried to make a todo application using bottle with python
I tried to easily detect facial landmarks with python and dlib
I tried to extract players and skill names from sports articles
I tried to teach Python to those who have no programming experience
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
I tried to summarize Python exception handling
I tried to implement PLSA in Python
I tried using Azure Speech to Text.
From Python to using MeCab (and CaboCha)
I tried using Twitter api and Line api
I tried to implement permutation in Python
I tried scraping Yahoo News with Python
I tried using Selenium with Headless chrome
Python3 standard input I tried to summarize
I tried using PyEZ and JSNAPy. Part 2: I tried using PyEZ
I tried using Bayesian Optimization in Python
[ML-Aents] I tried machine learning using Unity and Python TensorFlow (v0.11β compatible)
I tried to classify text using TensorFlow
I tried to get the authentication code of Qiita API with Python.
I tried to convert datetime <-> string with tzinfo using strftime () and strptime ()