Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4

Introduction

This is a continuation of the previous article (Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4).

There was an additional need to obtain an overview (business description, officers, shareholders, etc.) of the companies appearing in the news articles.

So, let's try to realize the process of acquiring "English" company information with a Python program. This time, the information source is ** Yahoo! Finance **.

What to introduce in this article

--Obtaining Profile from Yahoo! Finance

In addition, the author has confirmed the operation with the following version.

Not introduced in this article

--How to install and use the Python library

Sample code

Since the amount of Code is not large, I will introduce the entire Code. There are two points.

1. Explicit wait

It is a must to implement standby processing (Sleep) even in ** because it does not impose a load on the access destination **. In this article, unlike the previous article, Selenium is not used, but it is better to implement standby processing when using For loop processing so that the program does not issue explosive HTTP requests per unit time.

2. Specifying tag elements

It is necessary to look at the Source of each page, specify the element in consideration of the tag structure, and acquire the information with BeautifulSoup4. In many cases, you will specify the class attribute attached to the tag and implement the process to get the target tag (and the Text inside it).

Introducing Code

When you run the code, you will see the output of print () on the console.

crawler_yahoo.py


import requests
from bs4 import BeautifulSoup

def getSoup(url):
  html = requests.get(url)
  #soup = BeautifulSoup(html.content, "html.parser")
  soup = BeautifulSoup(html.content, "lxml")
  return soup

def getAssetProfile(soup):
  wrapper = soup.find("div", class_="asset-profile-container")
  paragraph = [element.text for element in wrapper.find_all("span", class_="Fw(600)")]
  return paragraph

def getKeyExecutives(soup):
  wrapper = soup.find("section", class_="Bxz(bb) quote-subsection undefined")
  paragraph = []
  for element in wrapper.find_all("tr", class_="C($primaryColor) BdB Bdc($seperatorColor) H(36px)"):
    name = element.find("td", class_="Ta(start)").find("span").text
    title = element.find("td", class_="Ta(start) W(45%)").find("span").text
    pay = element.find("td", class_="Ta(end)").find("span").text
    paragraph.append([name, title, pay])
  return paragraph

def getDescription(soup):
  wrapper = soup.find("section", class_="quote-sub-section Mt(30px)")
  paragraph = [element.text for element in wrapper.find_all("p", class_="Mt(15px) Lh(1.6)")]
  return paragraph

def getMajorHolders(soup):
  wrapper = soup.find("div", class_="W(100%) Mb(20px)")
  paragraph = []
  for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor)"):
    share = element.find("td", class_="Py(10px) Va(m) Fw(600) W(15%)").text
    heldby = element.find("td", class_="Py(10px) Ta(start) Va(m)").find("span").text
    paragraph.append([share, heldby])
  return paragraph

def getTopHolders(soup, category):
  idx = {'Institutional': 0, 'MutualFund': 1}
  wrapper = soup.find_all("div", class_="Mt(25px) Ovx(a) W(100%)")[idx[category]]
  paragraph = []
  for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor) Bgc($hoverBgColor):h Whs(nw) H(36px)"):
    tmp = [element.find("td", class_="Ta(start) Pend(10px)").text, ]
    tmp.extend([col.text for col in element.find_all("td", class_="Ta(end) Pstart(10px)")])
    paragraph.append(tmp)
  return paragraph

The execution method is shown using Apple (ticker symbol: APPL), which is a hot topic on iphone12, as an example. First, basic information.

python


soup = getSoup('https://finance.yahoo.com/quote/AAPL/profile?p=AAPL')

profile = getAssetProfile(soup)
print('\r\n'.join(profile))
#profile[0]: Sector(s)
#profile[1]: Industry
#profile[2]: Full Time Employees

Below is the execution result.

python


Technology
Consumer Electronics
147,000

Next is a list of officers.

python


exs = getKeyExecutives(soup)
#print('\r\n'.join(exs))
for ex in exs:
  print(ex)
  #ex[0]: Name
  #ex[1]: Title
  #ex[2]: Pay

Below is the execution result.

['Mr. Timothy D. Cook', 'CEO & Director', '11.56M']
['Mr. Luca  Maestri', 'CFO & Sr. VP', '3.58M']
['Mr. Jeffrey E. Williams', 'Chief Operating Officer', '3.57M']
['Ms. Katherine L. Adams', 'Sr. VP, Gen. Counsel & Sec.', '3.6M']
["Ms. Deirdre  O'Brien", 'Sr. VP of People & Retail', '2.69M']
['Mr. Chris  Kondo', 'Sr. Director of Corp. Accounting', 'N/A']
['Mr. James  Wilson', 'Chief Technology Officer', 'N/A']
['Ms. Mary  Demby', 'Chief Information Officer', 'N/A']
['Ms. Nancy  Paxton', 'Sr. Director of Investor Relations & Treasury', 'N/A']
['Mr. Greg  Joswiak', 'Sr. VP of Worldwide Marketing', 'N/A']

Next is the business content.

python


desc = getDescription(soup)
print('\r\n'.join(desc))

Below is the execution result.

Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, iPod touch, and other Apple-branded and third-party accessories. It also provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store, that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It sells and delivers third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1977 and is headquartered in Cupertino, California.

The URL has changed from here, and it is shareholder information. First is the summary.

python


soup = getSoup('https://finance.yahoo.com/quote/AAPL/holders?p=AAPL')

holders = getMajorHolders(soup)
for holder in holders:
  print(holder)
  #holder[0]: share
  #holder[1]: heldby

Below is the execution result.

['0.07%', '% of Shares Held by All Insider']
['62.12%', '% of Shares Held by Institutions']
['62.16%', '% of Float Held by Institutions']
['4,296', 'Number of Institutions Holding Shares']

Next is shareholder information (corporate shareholders).

python


topholders = getTopHolders(soup, 'Institutional')
for holder in topholders:
  print(holder)
  #holder[0]: Holder
  #holder[1]: Shares
  #holder[2]: Date Reported
  #holder[3]: % Out
  #holder[4]: Value

Below is the execution result.

['Vanguard Group, Inc. (The)', '1,315,961,000', 'Jun 29, 2020', '7.69%', '120,015,643,200']
['Blackrock Inc.', '1,101,824,048', 'Jun 29, 2020', '6.44%', '100,486,353,177']
['Berkshire Hathaway, Inc', '980,622,264', 'Jun 29, 2020', '5.73%', '89,432,750,476']
['State Street Corporation', '709,057,472', 'Jun 29, 2020', '4.15%', '64,666,041,446']
['FMR, LLC', '383,300,188', 'Jun 29, 2020', '2.24%', '34,956,977,145']
['Geode Capital Management, LLC', '251,695,416', 'Jun 29, 2020', '1.47%', '22,954,621,939']
['Price (T.Rowe) Associates Inc', '233,087,540', 'Jun 29, 2020', '1.36%', '21,257,583,648']
['Northern Trust Corporation', '214,144,092', 'Jun 29, 2020', '1.25%', '19,529,941,190']
['Norges Bank Investment Management', '187,425,092', 'Dec 30, 2019', '1.10%', '13,759,344,566']
['Bank Of New York Mellon Corporation', '171,219,584', 'Jun 29, 2020', '1.00%', '15,615,226,060']

Next is shareholder information (individual investment trusts).

python


topholders = getTopHolders(soup, 'MutualFund')
for holder in topholders:
  print(holder)
  #holder[0]: Holder
  #holder[1]: Shares
  #holder[2]: Date Reported
  #holder[3]: % Out
  #holder[4]: Value

Below is the execution result.

['Vanguard Total Stock Market Index Fund', '444,698,584', 'Jun 29, 2020', '2.60%', '40,556,510,860']
['Vanguard 500 Index Fund', '338,116,248', 'Jun 29, 2020', '1.98%', '30,836,201,817']
['SPDR S&P 500 ETF Trust', '169,565,200', 'Sep 29, 2020', '0.99%', '19,637,345,812']
['Invesco ETF Tr-Invesco QQQ Tr, Series 1 ETF', '155,032,988', 'Aug 30, 2020', '0.91%', '20,005,456,771']
['Fidelity 500 Index Fund', '145,557,920', 'Aug 30, 2020', '0.85%', '18,782,793,996']
['Vanguard Institutional Index Fund-Institutional Index Fund', '143,016,840', 'Jun 29, 2020', '0.84%', '13,043,135,808']
['iShares Core S&P 500 ETF', '123,444,255', 'Sep 29, 2020', '0.72%', '14,296,079,171']
['Vanguard Growth Index Fund', '123,245,072', 'Jun 29, 2020', '0.72%', '11,239,950,566']
['Vanguard Information Technology Index Fund', '79,770,560', 'Aug 30, 2020', '0.47%', '10,293,593,062']
['Select Sector SPDR Fund-Technology', '69,764,960', 'Sep 29, 2020', '0.41%', '8,079,480,017']

You can get the information displayed on the web browser properly. If you collect information on various companies, you can see a list of companies in which a famous individual investor is listed as a shareholder, or something like a tendency. .. ..

Summary

Introducing how to acquire (crawling) company information (from ** Yahoo! Finance **) using BeautifulSoup4.

Recommended Posts

Python programming: I tried to get company information (crawling) from Yahoo Finance in the US using BeautifulSoup4
Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.
I tried to get various information from the codeforces API
[Python] I tried to get various information using YouTube Data API!
I tried to get the movie information of TMDb API with Python
[Python] I tried to get the type name as a string from the type function
I tried to graph the packages installed in Python
I tried to create API list.csv in Python from swagger.yaml
I tried to get Web information using "Requests" and "lxml"
I want to get the operation information of yahoo route
I tried "How to get a method decorated in Python"
I tried programming the chi-square test in Python and Java.
I tried to implement the mail sending function in Python
I tried to make a stopwatch using tkinter in python
I tried changing the python script from 2.7.11 to 3.6.0 on windows10
I tried to get data from AS / 400 quickly using pypyodbc
I tried to create a Python script to get the value of a cell in Microsoft Excel
I tried to get the index of the list using the enumerate function
How to get followers and followers from python using the Mastodon API
I tried to get data from AS / 400 quickly using pypyodbc Preparation 1
PhytoMine-I tried to get the genetic information of plants with Python
I tried to implement permutation in Python
I tried to implement PLSA in Python 2
I tried using Bayesian Optimization in Python
I tried using UnityCloudBuild API from Python
I tried to implement ADALINE in Python
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
I tried to execute SQL from the local environment using Looker SDK
I tried to get the batting results of Hachinai using image processing
How to get a value from a parameter store in lambda (using python)
I tried to get the authentication code of Qiita API with Python.
I tried to analyze the New Year's card by myself using python
I tried to refactor the template code posted in "Getting images from Flickr API with Python" (Part 2)
I tried to deliver mail from Node.js and Python using the mail delivery service (SendGrid) of IBM Cloud!
vprof --I tried using the profiler for Python
[Python] I tried to judge the member image of the idol group using Keras
I want to email from Gmail using Python.
I tried simulating the "birthday paradox" in Python
I tried using the Python library "pykakasi" that can convert kanji to romaji.
I tried to get CloudWatch data with Python
I tried to implement TOPIC MODEL in Python
[IBM Cloud] I tried to access the Db2 on Cloud table from Cloud Funtions (python)
I tried using the Datetime module by Python
How to get the files in the [Python] folder
I want to get the file name, line number, and function name in Python 3.4
I tried to implement selection sort in python
I tried using the functional programming library toolz
I tried to execute Python code from .Net using Pythonnet (Hallo World edition)
I want to display the progress in Python!
I tried to explain how to get the article content with MediaWiki API in an easy-to-understand manner with examples (Python 3)
I tried to get the information of the .aspx site that is paging using Selenium IDE as non-programming as possible.
I tried to extract various information of remote PC from Python by WMI Library
Python practice 100 knocks I tried to visualize the decision tree of Chapter 5 using graphviz
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to uniquely determine the malware name from the report information obtained from Virus Total
How to get the variable name itself in python
How to get the number of digits in Python
I tried to detect the iris from the camera image
I tried using TradeWave (BitCoin system trading in Python)
I tried to touch the CSV file with Python
I tried to solve the soma cube with python