Introduction

This is a continuation of the previous article (Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4).

There was an additional need to obtain an overview (business description, officers, shareholders, etc.) of the companies appearing in the news articles.

So, let's try to realize the process of acquiring "English" company information with a Python program. This time, the information source is ** Yahoo! Finance **.

Code introduction and execution example are shown based on the information at the time of writing this article (2020/11/02).

What to introduce in this article

--Obtaining Profile from Yahoo! Finance

Ex.) https://finance.yahoo.com/quote/AAPL/profile?p=AAPL --Acquisition of Holders from Yahoo! Finance
Ex.) https://finance.yahoo.com/quote/AAPL/holders?p=AAPL

In addition, the author has confirmed the operation with the following version.

Python: 3.6.8
BeautifulSoup4: 4.9.1

Not introduced in this article

--How to install and use the Python library

requests
BeautifulSoup4 --How to get the Ticker Symbol (equivalent to the Japanese securities code) --Do not implement processing such as acquiring the ticker symbol from the company name and automatically generating the request URL.

Sample code

Since the amount of Code is not large, I will introduce the entire Code. There are two points.

1. Explicit wait

It is a must to implement standby processing (Sleep) even in ** because it does not impose a load on the access destination **. In this article, unlike the previous article, Selenium is not used, but it is better to implement standby processing when using For loop processing so that the program does not issue explosive HTTP requests per unit time.

2. Specifying tag elements

It is necessary to look at the Source of each page, specify the element in consideration of the tag structure, and acquire the information with BeautifulSoup4. In many cases, you will specify the class attribute attached to the tag and implement the process to get the target tag (and the Text inside it).

Introducing Code

When you run the code, you will see the output of print () on the console.

`crawler_yahoo.py`


import requests
from bs4 import BeautifulSoup

def getSoup(url):
  html = requests.get(url)
  #soup = BeautifulSoup(html.content, "html.parser")
  soup = BeautifulSoup(html.content, "lxml")
  return soup

def getAssetProfile(soup):
  wrapper = soup.find("div", class_="asset-profile-container")
  paragraph = [element.text for element in wrapper.find_all("span", class_="Fw(600)")]
  return paragraph

def getKeyExecutives(soup):
  wrapper = soup.find("section", class_="Bxz(bb) quote-subsection undefined")
  paragraph = []
  for element in wrapper.find_all("tr", class_="C($primaryColor) BdB Bdc($seperatorColor) H(36px)"):
    name = element.find("td", class_="Ta(start)").find("span").text
    title = element.find("td", class_="Ta(start) W(45%)").find("span").text
    pay = element.find("td", class_="Ta(end)").find("span").text
    paragraph.append([name, title, pay])
  return paragraph

def getDescription(soup):
  wrapper = soup.find("section", class_="quote-sub-section Mt(30px)")
  paragraph = [element.text for element in wrapper.find_all("p", class_="Mt(15px) Lh(1.6)")]
  return paragraph

def getMajorHolders(soup):
  wrapper = soup.find("div", class_="W(100%) Mb(20px)")
  paragraph = []
  for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor)"):
    share = element.find("td", class_="Py(10px) Va(m) Fw(600) W(15%)").text
    heldby = element.find("td", class_="Py(10px) Ta(start) Va(m)").find("span").text
    paragraph.append([share, heldby])
  return paragraph

def getTopHolders(soup, category):
  idx = {'Institutional': 0, 'MutualFund': 1}
  wrapper = soup.find_all("div", class_="Mt(25px) Ovx(a) W(100%)")[idx[category]]
  paragraph = []
  for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor) Bgc($hoverBgColor):h Whs(nw) H(36px)"):
    tmp = [element.find("td", class_="Ta(start) Pend(10px)").text, ]
    tmp.extend([col.text for col in element.find_all("td", class_="Ta(end) Pstart(10px)")])
    paragraph.append(tmp)
  return paragraph

The execution method is shown using Apple (ticker symbol: APPL), which is a hot topic on iphone12, as an example. First, basic information.