Ceci est la suite de l'article précédent (Programmation Python: j'ai essayé d'obtenir (l'exploration) des articles de presse en utilisant Selenium et BeautifulSoup4).
Il y avait un besoin supplémentaire d'obtenir une vue d'ensemble (description de l'activité, dirigeants, actionnaires, etc.) des entreprises apparaissant dans les articles de presse.
Alors, essayons de réaliser le processus d'acquisition des informations d'entreprise "anglaises" avec un programme Python. Cette fois, la source d'informations est ** Yahoo! Finance **.
--Obtenir un profil auprès de Yahoo! Finance
De plus, l'auteur a confirmé l'opération avec la version suivante.
Étant donné que la quantité de code n'est pas importante, je présenterai le code entier. Il y a deux points.
Il est indispensable de mettre en œuvre le traitement de veille (Sleep) même en ** car il n'impose pas de charge sur la destination d'accès **. Dans cet article, contrairement à l'article précédent, Selenium n'est pas utilisé, mais il est préférable d'implémenter le traitement de veille lors de l'utilisation du traitement en boucle For afin que le programme n'émette pas de requêtes HTTP explosives dans une unité de temps.
Il est nécessaire de regarder la source de chaque page, de spécifier l'élément en tenant compte de la structure des balises et d'acquérir les informations avec BeautifulSoup4. Dans de nombreux cas, vous spécifierez l'attribut de classe attaché à la balise et implémenterez le processus pour obtenir la balise cible (et le texte à l'intérieur).
Lorsque vous exécutez le code, vous verrez la sortie de print () sur la console.
crawler_yahoo.py
import requests
from bs4 import BeautifulSoup
def getSoup(url):
html = requests.get(url)
#soup = BeautifulSoup(html.content, "html.parser")
soup = BeautifulSoup(html.content, "lxml")
return soup
def getAssetProfile(soup):
wrapper = soup.find("div", class_="asset-profile-container")
paragraph = [element.text for element in wrapper.find_all("span", class_="Fw(600)")]
return paragraph
def getKeyExecutives(soup):
wrapper = soup.find("section", class_="Bxz(bb) quote-subsection undefined")
paragraph = []
for element in wrapper.find_all("tr", class_="C($primaryColor) BdB Bdc($seperatorColor) H(36px)"):
name = element.find("td", class_="Ta(start)").find("span").text
title = element.find("td", class_="Ta(start) W(45%)").find("span").text
pay = element.find("td", class_="Ta(end)").find("span").text
paragraph.append([name, title, pay])
return paragraph
def getDescription(soup):
wrapper = soup.find("section", class_="quote-sub-section Mt(30px)")
paragraph = [element.text for element in wrapper.find_all("p", class_="Mt(15px) Lh(1.6)")]
return paragraph
def getMajorHolders(soup):
wrapper = soup.find("div", class_="W(100%) Mb(20px)")
paragraph = []
for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor)"):
share = element.find("td", class_="Py(10px) Va(m) Fw(600) W(15%)").text
heldby = element.find("td", class_="Py(10px) Ta(start) Va(m)").find("span").text
paragraph.append([share, heldby])
return paragraph
def getTopHolders(soup, category):
idx = {'Institutional': 0, 'MutualFund': 1}
wrapper = soup.find_all("div", class_="Mt(25px) Ovx(a) W(100%)")[idx[category]]
paragraph = []
for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor) Bgc($hoverBgColor):h Whs(nw) H(36px)"):
tmp = [element.find("td", class_="Ta(start) Pend(10px)").text, ]
tmp.extend([col.text for col in element.find_all("td", class_="Ta(end) Pstart(10px)")])
paragraph.append(tmp)
return paragraph
La méthode d'exécution est illustrée en utilisant Apple (symbole ticker: APPL), qui est un sujet brûlant sur iphone12, à titre d'exemple. Tout d'abord, des informations de base.
python
soup = getSoup('https://finance.yahoo.com/quote/AAPL/profile?p=AAPL')
profile = getAssetProfile(soup)
print('\r\n'.join(profile))
#profile[0]: Sector(s)
#profile[1]: Industry
#profile[2]: Full Time Employees
Voici les résultats de l'exécution.
python
Technology
Consumer Electronics
147,000
Vient ensuite une liste d'officiers.
python
exs = getKeyExecutives(soup)
#print('\r\n'.join(exs))
for ex in exs:
print(ex)
#ex[0]: Name
#ex[1]: Title
#ex[2]: Pay
Voici les résultats de l'exécution.
['Mr. Timothy D. Cook', 'CEO & Director', '11.56M']
['Mr. Luca Maestri', 'CFO & Sr. VP', '3.58M']
['Mr. Jeffrey E. Williams', 'Chief Operating Officer', '3.57M']
['Ms. Katherine L. Adams', 'Sr. VP, Gen. Counsel & Sec.', '3.6M']
["Ms. Deirdre O'Brien", 'Sr. VP of People & Retail', '2.69M']
['Mr. Chris Kondo', 'Sr. Director of Corp. Accounting', 'N/A']
['Mr. James Wilson', 'Chief Technology Officer', 'N/A']
['Ms. Mary Demby', 'Chief Information Officer', 'N/A']
['Ms. Nancy Paxton', 'Sr. Director of Investor Relations & Treasury', 'N/A']
['Mr. Greg Joswiak', 'Sr. VP of Worldwide Marketing', 'N/A']
Vient ensuite le contenu commercial.
python
desc = getDescription(soup)
print('\r\n'.join(desc))
Voici les résultats de l'exécution.
Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, iPod touch, and other Apple-branded and third-party accessories. It also provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store, that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It sells and delivers third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1977 and is headquartered in Cupertino, California.
L'URL a changé à partir d'ici, et il s'agit d'informations sur les actionnaires. Le premier est le résumé.
python
soup = getSoup('https://finance.yahoo.com/quote/AAPL/holders?p=AAPL')
holders = getMajorHolders(soup)
for holder in holders:
print(holder)
#holder[0]: share
#holder[1]: heldby
Voici les résultats de l'exécution.
['0.07%', '% of Shares Held by All Insider']
['62.12%', '% of Shares Held by Institutions']
['62.16%', '% of Float Held by Institutions']
['4,296', 'Number of Institutions Holding Shares']
Vient ensuite les informations aux actionnaires (actionnaires corporatifs).
python
topholders = getTopHolders(soup, 'Institutional')
for holder in topholders:
print(holder)
#holder[0]: Holder
#holder[1]: Shares
#holder[2]: Date Reported
#holder[3]: % Out
#holder[4]: Value
Voici les résultats de l'exécution.
['Vanguard Group, Inc. (The)', '1,315,961,000', 'Jun 29, 2020', '7.69%', '120,015,643,200']
['Blackrock Inc.', '1,101,824,048', 'Jun 29, 2020', '6.44%', '100,486,353,177']
['Berkshire Hathaway, Inc', '980,622,264', 'Jun 29, 2020', '5.73%', '89,432,750,476']
['State Street Corporation', '709,057,472', 'Jun 29, 2020', '4.15%', '64,666,041,446']
['FMR, LLC', '383,300,188', 'Jun 29, 2020', '2.24%', '34,956,977,145']
['Geode Capital Management, LLC', '251,695,416', 'Jun 29, 2020', '1.47%', '22,954,621,939']
['Price (T.Rowe) Associates Inc', '233,087,540', 'Jun 29, 2020', '1.36%', '21,257,583,648']
['Northern Trust Corporation', '214,144,092', 'Jun 29, 2020', '1.25%', '19,529,941,190']
['Norges Bank Investment Management', '187,425,092', 'Dec 30, 2019', '1.10%', '13,759,344,566']
['Bank Of New York Mellon Corporation', '171,219,584', 'Jun 29, 2020', '1.00%', '15,615,226,060']
Viennent ensuite les informations aux actionnaires (fiducie de placement individuelle).
python
topholders = getTopHolders(soup, 'MutualFund')
for holder in topholders:
print(holder)
#holder[0]: Holder
#holder[1]: Shares
#holder[2]: Date Reported
#holder[3]: % Out
#holder[4]: Value
Voici les résultats de l'exécution.
['Vanguard Total Stock Market Index Fund', '444,698,584', 'Jun 29, 2020', '2.60%', '40,556,510,860']
['Vanguard 500 Index Fund', '338,116,248', 'Jun 29, 2020', '1.98%', '30,836,201,817']
['SPDR S&P 500 ETF Trust', '169,565,200', 'Sep 29, 2020', '0.99%', '19,637,345,812']
['Invesco ETF Tr-Invesco QQQ Tr, Series 1 ETF', '155,032,988', 'Aug 30, 2020', '0.91%', '20,005,456,771']
['Fidelity 500 Index Fund', '145,557,920', 'Aug 30, 2020', '0.85%', '18,782,793,996']
['Vanguard Institutional Index Fund-Institutional Index Fund', '143,016,840', 'Jun 29, 2020', '0.84%', '13,043,135,808']
['iShares Core S&P 500 ETF', '123,444,255', 'Sep 29, 2020', '0.72%', '14,296,079,171']
['Vanguard Growth Index Fund', '123,245,072', 'Jun 29, 2020', '0.72%', '11,239,950,566']
['Vanguard Information Technology Index Fund', '79,770,560', 'Aug 30, 2020', '0.47%', '10,293,593,062']
['Select Sector SPDR Fund-Technology', '69,764,960', 'Sep 29, 2020', '0.41%', '8,079,480,017']
Vous avez correctement obtenu les informations affichées sur le navigateur Web. Si vous collectez des informations sur différentes entreprises, vous pouvez voir une liste d'entreprises dans lesquelles un investisseur individuel célèbre est répertorié comme actionnaire, ou quelque chose comme une tendance. .. ..
Présentation de l'acquisition (d'exploration) des informations d'entreprise (à partir de ** Yahoo! Finance **) à l'aide de BeautifulSoup4.
Recommended Posts