Scraping desired data from website by linking Python and Excel

Purpose of scraping

First of all, it is important to say what you want to achieve by using the technology even at the small application level.

I mainly invest in investment trusts (hereinafter referred to as investment trusts), but in recent years, a wide range of investment trust products have been sold, and good products are being sold all the time.

"Is the product I am buying really good?" "I think there are products that are cheaper and more profitable."

** Once a year ** thinks about such a thing.

That's why I search on the investment trust page of the securities site and compare products, but there are many items that I have to look at, such as the base price, Sharpe ratio, trust fee management, etc. I want to compare about 10.

** Oh, I want a rough list of information in tabular format ...! ** **

This is the purpose of this scraping.

Environment and target site

OS：Windows Programming language: Python (3.8) Software: Excel (2016) Editor: Visual Studio Code

An example of the target site Each product page of Rakuten Securities (because I mainly use Rakuten Securities) https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6

The page structure is the same for other products.

I used ** BeautifulSoup ** for scraping and ** Openpyxl ** for linking to Excel. The reason I chose these libraries was that I just searched for them, and there were many articles that used these libraries. It is a rule of thumb for beginners to enter from a place with a large amount of information. (I've never done Python)

Official documentation BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/ Openpyxl https://openpyxl.readthedocs.io/en/stable/tutorial.html

Advance preparation

I prepared an Excel file in advance.

Items from the left ・ Fund name ·Securities company ・ Classification (like whether it is a domestic stock or a developed country stock) ・ Base price ・ Net assets (100 million) ・ Last time net assets (100 million) ・ Increase / decrease in net assets ・ Latest distribution ・ Purchase fee ・ Management cost ratio (costs such as trust fees and administrative fees) ・ URL

It was made. The meanings of the items have nothing to do with this article, so I will omit them here. In short, think of it as an item for comparing investment trust specifications. There are actually many more.

By the way, "previous net assets (100 million)" and "net assets increase / decrease" have numerical values and functions set in advance. I want to take the difference between "net assets (100 million)" and "previous net assets (100 million)".

Since the "URL" is also known in advance, I will write it from the beginning.

By the way, the data to be acquired this time is ** only public information that does not require login etc. **. I don't get information about what products I bought, how much, and how many.

The purpose is to compare financial products themselves.

Scraping

First, prepare to use Beautiful Soup.

`fund.py`


import requests, bs4

Since it is assumed that we will access multiple URLs in Rakuten Securities this time, we will define a method that takes URLs as arguments in advance.

`fund.py`


#Rakuten Securities
def GetRakutenFund(url):
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")

Since we have already decided the items we want to get, we will also define the class.

`fund.py`


#Fund information class
class FundInfo:
    def __init__(self):
        self.name = ''
        self.company = ''
        self.category = ''
        self.baseprice = ''
        self.assets = 0
        self.allotment = 0
        self.commision = ''
        self.cost = 0

** The information scraped by the GetRakutenFund method is stored in the FundInfo instance **.

Now, get the information to get the information you want from this site. https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6

Although it is solid, we will use developer tools to identify the elements. As a result, it was found that the structure is as follows.

item name	name of the class
Fund name	fund-name
Classification	fund-type
Base price	value-01
Net worth	value-02
Most recent distribution	value-02
Purchase fee	no-fee
Management cost rate	No class name

Basically, if the class name is unique, you can easily get the data, but this time it seems that is not the case. Net assets and most recent distributions use the same class name and The management cost rate did not have a class name.

So, this time, if I couldn't specify it by the class name, I did something like ** take the element one level higher and put it in the contents array **. キャプチャ.PNG

This image is grouped in a class called tbl-fund-summary. From that, I extracted the element with the class name value-02.

`fund.py`


fundsummary = soup.find("table", attrs={"class", "tbl-fund-summary"})
elements = fundsummary.find_all("span", attrs={"class", "value-02"})
fundinfo.assets = elements[0].text
fundinfo.allotment = elements[1].text

We were able to identify elements [0] as net assets and elements [1] as the most recent distribution.

Identify the management cost ratio in the same way. キャプチャ2.PNG

This item was a single td item in the class called trust-fee of the li element.

`fund.py`


costs = soup.find("li", attrs={"class", "trust-fee"})
elements = costs.find_all("td")
fundinfo.cost = elements[0].text

Finally, the GetRakutenFund method does this:

`fund.py`


#Rakuten Securities
def GetRakutenFund(url):
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")

    fundinfo = FundInfo()
    #Fund name, classification
    fundinfo.name = soup.select_one('.fund-name').text
    fundinfo.company = 'Rakuten'
    fundinfo.category = soup.select_one('.fund-type').text
    #Base price, net assets, most recent distribution
    fundsummary = soup.find("table", attrs={"class", "tbl-fund-summary"})
    elemnt = fundsummary.select_one('.value-01')
    fundinfo.baseprice = elemnt.text + elemnt.nextSibling
    elements = fundsummary.find_all("span", attrs={"class", "value-02"})
    fundinfo.assets = elements[0].text
    fundinfo.allotment = elements[1].text
    #Management fees such as purchase fees and trust fees
    fundinfo.commision = soup.select_one('.no-fee').text
    costs = soup.find("li", attrs={"class", "trust-fee"})
    elements = costs.find_all("td")
    fundinfo.cost = elements[0].text

    return fundinfo

If you are familiar with scraping around here, there should be a smarter description method.

And the caller of the method. Since it is separated from the main processing file, fund.py is imported as fund.

`main.py`


nam = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6')

Linkage to Excel

I was able to get the desired information as an instance of FundInfo type. Pour this data into Excel.

I want to use Openpyxl, so please install it from pip etc. first. After installing, write an import statement.

`exceloperator.py`


import openpyxl

Then define the method that executes the Excel process.

`exceloperator.py`


def WriteExcel(fund_info_list):

There were ** 4 ** URLs I hadn't written before but wanted to get information this time. So I would like to store 4 instances of FundInfo in a list (fund_info_list), pass it as an argument to the method that executes Excel processing, and perform processing in a loop.

First, load the Excel prepared in advance. Then get the worksheet you want to process. In this case, the "fund" sheet is the target.

`exceloperator.py`


#r ignores escape sequences
wb = openpyxl.load_workbook(r'Excel file path')
ws = wb['fund']

When the path is specified as an argument, backslash etc. are not good in Windows environment. If you add r, it will ignore the escape sequence.

All you have to do now is set each item of FundInfo in the list to the corresponding cell. In my case this time, the 6th and 7th columns are items for taking the difference from the previous confirmation, so I will not update the data. There seemed to be a way to pack them in an array, but for the time being, I took the method of setting them one by one.

`exceloperator.py`


row = 2
for fund in fund_info_list:
    col = 1
    #6th and 7th columns are not subject to update
    ws.cell(column=col, row=row, value=fund.name)
    col += 1
    ws.cell(column=col, row=row, value=fund.company)
    col += 1
    ws.cell(column=col, row=row, value=fund.category)
    col += 1
    ws.cell(column=col, row=row, value=fund.baseprice)
    col += 1
    ws.cell(column=col, row=row, value=float(fund.assets.replace(',', '')))
    col += 3
    ws.cell(column=col, row=row, value=int(fund.allotment))
    col += 1
    if fund.commision == 'None':
        ws.cell(column=col, row=row, value=0)
    else:
        ws.cell(column=col, row=row, value=fund.commision)
    col += 1
    ws.cell(column=col, row=row, value=fund.cost)
    row += 1

Another thing to note is that I want to treat the net assets (assets) and the latest distribution (allotment) as numerical types, so I convert them numerically and set them in the cell. Since there is a possibility that the amount of net assets will contain commas separated by 1000, we have added a process to remove the commas. The purchase fee is written as "None" on the site (I only buy it), but it is easier to treat it as 0 yen than "None", so I am converting it here.

Oh, I want an increment ... (C # er's murmur)

Save it properly at the end. If you specify the path of the opened file, it will be overwritten and saved.

`exceloperator.py`


wb.save(r'Excel file path')

The whole code

It's a beginner who isn't beautiful, so please take a look.

`main.py`


import fund, exceloperator

#Main function
#<No purchase / cash fees> Nissay TOPIX Index Fund
nam = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000BRT6')
#Straw no-load developed country stocks
am_one = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000CMK4')
#eMAXIS Slim Emerging Markets Equity Index
emax_emarging = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000F7H5')
#eMAXIS Slim US Stock (S & P500)
emax_sp500 = fund.GetRakutenFund('https://www.rakuten-sec.co.jp/web/fund/detail/?ID=JP90C000GKC6')

#Write to EXCEL
fund_info_list = [nam, am_one, emax_emarging, emax_sp500]
exceloperator.WriteExcel(fund_info_list)

`fund.py`


#Scraping with Beautiful Soup 4
import requests, bs4

#Fund information class
class FundInfo:
    def __init__(self):
        self.name = ''
        self.company = ''
        self.category = ''
        self.baseprice = ''
        self.assets = 0
        self.allotment = 0
        self.commision = ''
        self.cost = 0
    
#Rakuten Securities
def GetRakutenFund(url):
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, "html.parser")

    fundinfo = FundInfo()
    #Fund name, classification
    fundinfo.name = soup.select_one('.fund-name').text
    fundinfo.company = 'Rakuten'
    fundinfo.category = soup.select_one('.fund-type').text
    #Base price, net assets, most recent distribution
    fundsummary = soup.find("table", attrs={"class", "tbl-fund-summary"})
    elemnt = fundsummary.select_one('.value-01')
    fundinfo.baseprice = elemnt.text + elemnt.nextSibling
    elements = fundsummary.find_all("span", attrs={"class", "value-02"})
    fundinfo.assets = elements[0].text
    fundinfo.allotment = elements[1].text
    #Management fees such as purchase fees and trust fees
    fundinfo.commision = soup.select_one('.no-fee').text
    costs = soup.find("li", attrs={"class", "trust-fee"})
    elements = costs.find_all("td")
    fundinfo.cost = elements[0].text

    return fundinfo

`exceloperator.py`


#Excel operation using openpyxl
import openpyxl

def WriteExcel(fund_info_list):
    #r ignores escape sequences
    wb = openpyxl.load_workbook(r'Excel file path')
    ws = wb['fund']
    
    row = 2
    for fund in fund_info_list:
        col = 1
        #6th and 7th columns are not subject to update
        ws.cell(column=col, row=row, value=fund.name)
        col += 1
        ws.cell(column=col, row=row, value=fund.company)
        col += 1
        ws.cell(column=col, row=row, value=fund.category)
        col += 1
        ws.cell(column=col, row=row, value=fund.baseprice)
        col += 1
        ws.cell(column=col, row=row, value=float(fund.assets.replace(',', '')))
        col += 3
        ws.cell(column=col, row=row, value=int(fund.allotment))
        col += 1
        if fund.commision == 'None':
            ws.cell(column=col, row=row, value=0)
        else:
            ws.cell(column=col, row=row, value=fund.commision)
        col += 1
        ws.cell(column=col, row=row, value=fund.cost)
        row += 1

    wb.save(r'Excel file path')

Excel file after scraping

Impressions

You may have written decent Python code for the first time. I'm glad I was able to achieve the desired movement. I still have a lot of grammar ...

As long as my main job is business, it's difficult to write the technology I use at work ... There aren't many stories on the job that I can write crazy.

He said that Excel will also strengthen its cooperation with Python. (Is it better to use xlwings?) Actually, I secretly love Excel, so if I can use Excel, I want to continue using it. (desire)