Scraping desired data from website by linking Python and Excel

Purpose of scraping

First of all, it is important to say what you want to achieve by using the technology even at the small application level.

I mainly invest in investment trusts (hereinafter referred to as investment trusts), but in recent years, a wide range of investment trust products have been sold, and good products are being sold all the time.

"Is the product I am buying really good?" "I think there are products that are cheaper and more profitable."

** Once a year ** thinks about such a thing.

That's why I search on the investment trust page of the securities site and compare products, but there are many items that I have to look at, such as the base price, Sharpe ratio, trust fee management, etc. I want to compare about 10.

** Oh, I want a rough list of information in tabular format ...! ** **

This is the purpose of this scraping.

Environment and target site

OS:Windows Programming language: Python (3.8) Software: Excel (2016) Editor: Visual Studio Code

An example of the target site Each product page of Rakuten Securities (because I mainly use Rakuten Securities) https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6

The page structure is the same for other products.

I used ** BeautifulSoup ** for scraping and ** Openpyxl ** for linking to Excel. The reason I chose these libraries was that I just searched for them, and there were many articles that used these libraries. It is a rule of thumb for beginners to enter from a place with a large amount of information. (I've never done Python)

Official documentation BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Openpyxl https://openpyxl.readthedocs.io/en/stable/tutorial.html

Advance preparation

I prepared an Excel file in advance. WS000056.JPG

Items from the left ・ Fund name ·Securities company ・ Classification (like whether it is a domestic stock or a developed country stock) ・ Base price ・ Net assets (100 million) ・ Last time net assets (100 million) ・ Increase / decrease in net assets ・ Latest distribution ・ Purchase fee ・ Management cost ratio (costs such as trust fees and administrative fees) ・ URL

It was made. The meanings of the items have nothing to do with this article, so I will omit them here. In short, think of it as an item for comparing investment trust specifications. There are actually many more.

By the way, "previous net assets (100 million)" and "net assets increase / decrease" have numerical values and functions set in advance. I want to take the difference between "net assets (100 million)" and "previous net assets (100 million)".

Since the "URL" is also known in advance, I will write it from the beginning.

By the way, the data to be acquired this time is ** only public information that does not require login etc. **. I don't get information about what products I bought, how much, and how many.

The purpose is to compare financial products themselves.

Scraping

First, prepare to use Beautiful Soup.

fund.py


import requests, bs4

Since it is assumed that we will access multiple URLs in Rakuten Securities this time, we will define a method that takes URLs as arguments in advance.

fund.py


#Rakuten Securities
def GetRakutenFund(url):
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")

Since we have already decided the items we want to get, we will also define the class.

fund.py


#Fund information class
class FundInfo:
    def __init__(self):
        self.name = ''
        self.company = ''
        self.category = ''
        self.baseprice = ''
        self.assets = 0
        self.allotment = 0
        self.commision = ''
        self.cost = 0

** The information scraped by the GetRakutenFund method is stored in the FundInfo instance **.

Now, get the information to get the information you want from this site. https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6

Although it is solid, we will use developer tools to identify the elements. As a result, it was found that the structure is as follows.

item name name of the class
Fund name fund-name
Classification fund-type
Base price value-01
Net worth value-02
Most recent distribution value-02
Purchase fee no-fee
Management cost rate No class name

Basically, if the class name is unique, you can easily get the data, but this time it seems that is not the case. Net assets and most recent distributions use the same class name and The management cost rate did not have a class name.

So, this time, if I couldn't specify it by the class name, I did something like ** take the element one level higher and put it in the contents array **. キャプチャ.PNG

This image is grouped in a class called tbl-fund-summary. From that, I extracted the element with the class name value-02.

fund.py


fundsummary = soup.find("table", attrs={"class", "tbl-fund-summary"})
elements = fundsummary.find_all("span", attrs={"class", "value-02"})
fundinfo.assets = elements[0].text
fundinfo.allotment = elements[1].text

We were able to identify elements [0] as net assets and elements [1] as the most recent distribution.

Identify the management cost ratio in the same way. キャプチャ2.PNG

This item was a single td item in the class called trust-fee of the li element.

fund.py


costs = soup.find("li", attrs={"class", "trust-fee"})
elements = costs.find_all("td")
fundinfo.cost = elements[0].text

Finally, the GetRakutenFund method does this:

fund.py


#Rakuten Securities
def GetRakutenFund(url):
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")

    fundinfo = FundInfo()
    #Fund name, classification
    fundinfo.name = soup.select_one('.fund-name').text
    fundinfo.company = 'Rakuten'
    fundinfo.category = soup.select_one('.fund-type').text
    #Base price, net assets, most recent distribution
    fundsummary = soup.find("table", attrs={"class", "tbl-fund-summary"})
    elemnt = fundsummary.select_one('.value-01')
    fundinfo.baseprice = elemnt.text + elemnt.nextSibling
    elements = fundsummary.find_all("span", attrs={"class", "value-02"})
    fundinfo.assets = elements[0].text
    fundinfo.allotment = elements[1].text
    #Management fees such as purchase fees and trust fees
    fundinfo.commision = soup.select_one('.no-fee').text
    costs = soup.find("li", attrs={"class", "trust-fee"})
    elements = costs.find_all("td")
    fundinfo.cost = elements[0].text

    return fundinfo

If you are familiar with scraping around here, there should be a smarter description method.

And the caller of the method. Since it is separated from the main processing file, fund.py is imported as fund.

main.py


nam = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6')

Linkage to Excel

I was able to get the desired information as an instance of FundInfo type. Pour this data into Excel.

I want to use Openpyxl, so please install it from pip etc. first. After installing, write an import statement.

exceloperator.py


import openpyxl

Then define the method that executes the Excel process.

exceloperator.py


def WriteExcel(fund_info_list):

There were ** 4 ** URLs I hadn't written before but wanted to get information this time. So I would like to store 4 instances of FundInfo in a list (fund_info_list), pass it as an argument to the method that executes Excel processing, and perform processing in a loop.

First, load the Excel prepared in advance. Then get the worksheet you want to process. In this case, the "fund" sheet is the target.

exceloperator.py


#r ignores escape sequences
wb = openpyxl.load_workbook(r'Excel file path')
ws = wb['fund']

When the path is specified as an argument, backslash etc. are not good in Windows environment. If you add r, it will ignore the escape sequence.

All you have to do now is set each item of FundInfo in the list to the corresponding cell. In my case this time, the 6th and 7th columns are items for taking the difference from the previous confirmation, so I will not update the data. There seemed to be a way to pack them in an array, but for the time being, I took the method of setting them one by one.

exceloperator.py


row = 2
for fund in fund_info_list:
    col = 1
    #6th and 7th columns are not subject to update
    ws.cell(column=col, row=row, value=fund.name)
    col += 1
    ws.cell(column=col, row=row, value=fund.company)
    col += 1
    ws.cell(column=col, row=row, value=fund.category)
    col += 1
    ws.cell(column=col, row=row, value=fund.baseprice)
    col += 1
    ws.cell(column=col, row=row, value=float(fund.assets.replace(',', '')))
    col += 3
    ws.cell(column=col, row=row, value=int(fund.allotment))
    col += 1
    if fund.commision == 'None':
        ws.cell(column=col, row=row, value=0)
    else:
        ws.cell(column=col, row=row, value=fund.commision)
    col += 1
    ws.cell(column=col, row=row, value=fund.cost)
    row += 1

Another thing to note is that I want to treat the net assets (assets) and the latest distribution (allotment) as numerical types, so I convert them numerically and set them in the cell. Since there is a possibility that the amount of net assets will contain commas separated by 1000, we have added a process to remove the commas. The purchase fee is written as "None" on the site (I only buy it), but it is easier to treat it as 0 yen than "None", so I am converting it here.

Oh, I want an increment ... (C # er's murmur)

Save it properly at the end. If you specify the path of the opened file, it will be overwritten and saved.

exceloperator.py


wb.save(r'Excel file path')

The whole code

It's a beginner who isn't beautiful, so please take a look.

main.py


import fund, exceloperator

#Main function
#<No purchase / cash fees> Nissay TOPIX Index Fund
nam = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6')
#Straw no-load developed country stocks
am_one = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000CMK4')
#eMAXIS Slim Emerging Markets Equity Index
emax_emarging = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000F7H5')
#eMAXIS Slim US Stock (S & P500)
emax_sp500 = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000GKC6')

#Write to EXCEL
fund_info_list = [nam, am_one, emax_emarging, emax_sp500]
exceloperator.WriteExcel(fund_info_list)

fund.py


#Scraping with Beautiful Soup 4
import requests, bs4

#Fund information class
class FundInfo:
    def __init__(self):
        self.name = ''
        self.company = ''
        self.category = ''
        self.baseprice = ''
        self.assets = 0
        self.allotment = 0
        self.commision = ''
        self.cost = 0
    
#Rakuten Securities
def GetRakutenFund(url):
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")

    fundinfo = FundInfo()
    #Fund name, classification
    fundinfo.name = soup.select_one('.fund-name').text
    fundinfo.company = 'Rakuten'
    fundinfo.category = soup.select_one('.fund-type').text
    #Base price, net assets, most recent distribution
    fundsummary = soup.find("table", attrs={"class", "tbl-fund-summary"})
    elemnt = fundsummary.select_one('.value-01')
    fundinfo.baseprice = elemnt.text + elemnt.nextSibling
    elements = fundsummary.find_all("span", attrs={"class", "value-02"})
    fundinfo.assets = elements[0].text
    fundinfo.allotment = elements[1].text
    #Management fees such as purchase fees and trust fees
    fundinfo.commision = soup.select_one('.no-fee').text
    costs = soup.find("li", attrs={"class", "trust-fee"})
    elements = costs.find_all("td")
    fundinfo.cost = elements[0].text

    return fundinfo

exceloperator.py


#Excel operation using openpyxl
import openpyxl

def WriteExcel(fund_info_list):
    #r ignores escape sequences
    wb = openpyxl.load_workbook(r'Excel file path')
    ws = wb['fund']
    
    row = 2
    for fund in fund_info_list:
        col = 1
        #6th and 7th columns are not subject to update
        ws.cell(column=col, row=row, value=fund.name)
        col += 1
        ws.cell(column=col, row=row, value=fund.company)
        col += 1
        ws.cell(column=col, row=row, value=fund.category)
        col += 1
        ws.cell(column=col, row=row, value=fund.baseprice)
        col += 1
        ws.cell(column=col, row=row, value=float(fund.assets.replace(',', '')))
        col += 3
        ws.cell(column=col, row=row, value=int(fund.allotment))
        col += 1
        if fund.commision == 'None':
            ws.cell(column=col, row=row, value=0)
        else:
            ws.cell(column=col, row=row, value=fund.commision)
        col += 1
        ws.cell(column=col, row=row, value=fund.cost)
        row += 1

    wb.save(r'Excel file path')

Excel file after scraping

WS000055.JPG

Impressions

You may have written decent Python code for the first time. I'm glad I was able to achieve the desired movement. I still have a lot of grammar ...

As long as my main job is business, it's difficult to write the technology I use at work ... There aren't many stories on the job that I can write crazy.

He said that Excel will also strengthen its cooperation with Python. (Is it better to use xlwings?) Actually, I secretly love Excel, so if I can use Excel, I want to continue using it. (desire)

Recommended Posts

Scraping desired data from website by linking Python and Excel
[Python] How to read data from CIFAR-10 and CIFAR-100
(Miscellaneous notes) Data update pattern from CSV data acquisition / processing to Excel by Python
Run python from excel
The VIF calculated by Python and the VIF calculated by Excel are different .. ??
Collect machine learning data by scraping from bio-based public databases
Visualize plant activity from space using satellite data and Python
[Python] Random data extraction / combination from DataFrame using random and pandas
Automatic acquisition of gene expression level data by python and R
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Read an Excel sheet and loop it line by line Python VBA
Scraping with Node, Ruby and Python
Get data from Quandl in Python
Python, yield, return, and sometimes yield from
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
Read and use Python files from Python
About Python, from and import, as
Hashing data in R and Python
[Python] Scraping lens information from Kakaku.com
Extract and plot the latest population data from the PDF data provided by the city
Make a decision tree from 0 with Python and understand it (4. Data structure)
[python] A story about collecting twitter account names from handle names (like @ 123456) by combining BeautifulSoup and Excel input / output.
I tried scraping conversation data from Askfm
Get property information by scraping with python
Data pipeline construction with Python and Luigi
Receive textual data from mysql with python
[Note] Get data from PostgreSQL with Python
Socket communication and multi-thread processing by Python
Use PostgreSQL data type (jsonb) from Python
Python: Reading JSON data from web API
Convert Excel data to JSON with python
Python data structure and internal implementation ~ List ~
Python data structure and operation (Python learning memo ③)
Porting and modifying doublet-solver from python2 to python3.
[Python] Web application from 0! Hands-on (4) -Data molding-
Scraping a website using JavaScript in Python
Easily graph data in shell and Python
Scraping from an authenticated site with python
Socket communication by C language and Python
Organize data divided by folder with Python
Compress python data and write to sqlite
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[Note] Execute Python code from Excel (xlwings)
[python] Read html file and practice scraping
By linking Maya with Excel or spreadsheets ~
Exchange encrypted data between Python and C #
[Python] From morphological analysis of CSV data to CSV output and graph display [GiNZA]
Try to make BOT by linking spreadsheet and Slack with python 2/2 (python + gspread + slackbot)
Python --Read data from a numeric data file and find the multiple regression line.
Try to make BOT by linking spreadsheet and Slack with python 1/2 (python + gspread + slackbot)
[2021 version] From Selenium Basic installation to Web scraping execution by Python Windows 10 (64bit)
WEB scraping with python and try to make a word cloud from reviews
Machine Learning with docker (40) with anaconda (40) "Hands-On Data Science and Python Machine Learning" By Frank Kane