Scraping with Python-Getting the base price of mutual funds from Yahoo! Finance

About this article

I will explain how to get the base price of investment trust from Yahoo! Finance by web scraping using Python and lxml.

** [Addition] Data scraping from Yahoo! Finance seems to be prohibited by the rules, so please use an alternative method. ** ** Scraping with Python-Getting the base price of mutual funds from the investment trust association web

environment

Windows10 x64 Python 2.7.11 lxml 3.5.0

change history

2016/1/16

--Changed to pass url directly to lxml.html.parse (). Removed import of urllib2. --When generating url, take the argument to dict and then expand it with format (). --Changed how to turn for --I decided to do .encode ('utf-8') processing to the element obtained by XPath from ElementTree in advance with map ().

procedure

Check the position of the data you want to get

Since the purpose of web scraping is to extract the text at a specific position in HTML / XML, first check the target data position. At this time, the verification function of Chrome is easy to use. (See below)

Basic usage of Chrome developer tools (verify elements)

Right-click on the page and select Validate. (You can also use Ctrl + Shift + I) Then an HTML element will appear in the right half of the screen, and when you select a tag, the corresponding part on the screen will be inverted. We will use this to dig deeper to the point where we can identify the data we want to acquire.

In the case of Yahoo! Finance, the hierarchy is as follows. `

Here is the standard price ` ## Write the data position you want to get in XPath XPath is a format for representing the location of arbitrary content in an HTML / XML document. Right click in Chrome-> Copy-> Copy Xpath You can get the XPath with. (See below)

Easy to get XPath of any node with Chrome only Maybe revolution

This time, I want all the td elements of the table under the div element of id = main, so I did as follows. //*[@id="main"]/div/table//td

Get HTML with parser () and extract required elements with XPath

From here on, we'll do it in Python. Pass the url to lxml.html.parser () to get the HTML_Elements and extract the elements specified by XPath from it. Finally, arrange the model and output it as a list of [Date, base price, total net assets] model. The date was finally a string of type yyyymmdd.

getNAV.py


# -*- coding: utf-8 -*-
# python 2.7
import lxml.html
import datetime

def getNAV(fundcode, sy, sm, sd, ey, em, ed):
    #Push the argument into the dict
    d = dict(fundcode=fundcode, sy=sy, sm=sm, sd=sd, ey=ey, em=em, ed=ed)

    #Unpack dict to generate URL
    url = 'http://info.finance.yahoo.co.jp/history/?code={fundcode} \
        &sy={sy}&sm={sm}&sd={sd}&ey={ey}&em={em}&ed={ed}&tm=d'.format(**d)

    #Get ElementTree
    tree = lxml.html.parse(url)

    #date,Base price,Apply map and utf while getting all the elements of net worth-8 conversion and comma removal
    contents = map(lambda html: html.text.encode('utf-8').replace(',',''), tree.xpath('//*[@id="main"]/div/table//td'))

    #Because it is one list[[date, price, cap], [date, price, cap], ...]Divide with
    res = []
    for i in range(0, len(contents)-1, 3):
        date = datetime.datetime.strptime(contents[i], '%Y year%m month%d day').strftime('%Y%m%d')
        price = int(contents[i+1])
        cap = int(contents[i+2])
        res.append([date, price, cap])

    return res

if __name__ == '__main__':
    #Push parameters into dict
    args = dict(fundcode='64311104', sy='2015', sm='12', sd='1', ey='2015', em='12', ed='20')
    #Pass the dict and unpack
    print getNAV(**args)

Referenced articles

lxml - Processing XML and HTML with Python Tips for scraping with lxml [Python] Scraping notes with lxml