Store the stock price scraped by Python in the DB

0. Introduction

** This is the third (final) ** of the summary article that I investigated for the purpose of using python for investment utilization. As a omnibus so far, I will write a series of flow to store in the DB while scraping. Basically, I will describe it on the assumption that all the following articles are known. (This article also has as many comments as possible, but please see past articles for more details.)

** Articles up to the last ** ** Handling database easily with Python (SQLite3) ** ** Web scraping with Python ① (Scraping prior knowledge) ** ** Web scraping with Python ② (actually scraping stock sites) **

1. Goal of this article

  1. Extract 4 prices + trading volume of all listed stocks in 2020 (1/6-12/28) from Stock Investment Memo
    * This site is robots.txt: Allow: /, and scraping permission is granted.
  2. Store the retrieved information in the database (SQLite)

2. Confirmation of URL and HTML tags for extracting individual stock data (for one year in 2020)

2-1. First of all, which URL can I get? confirm

For example, if you search for Toyota Motor Corporation from the TOP page of the stock investment memo, you can jump to the page of the annual transition DB, and you can get the data of 2020 by selecting "2020" from it. Also, if you check the URL at that time, you can see that it is / stock/securities code/year you want to check. In other words, by changing this securities code part, it is possible to switch to data of various brands.

2-2. Which tag in HTML next has the information you want to get? Go to check

** The details have already been explained in the previous article, so please check it for details. ** ** If you search for the acquisition target using Chrome's verification function, you can see that the thead and tbody parts under the `table class =" stock_table "`` are like that. However, the difference from the previous article is that the stock price is not included directly under the tbody, but is stored for each individual tbody. But it's not a big deal.

2-3. Try to acquire Toyota Motor's 2020 stock price data as an experiment

Confirmation: https://www.ugtop.com/spill.shtml

python


import requests
from bs4 import BeautifulSoup

#Enter the URL to be scraped
url = 'https://kabuoji3.com/stock/7203/2020/'
#My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}
#Information from websites using the Requests library(HTML)To get.
response = requests.get(url, headers = headers)
#Create a BeautifulSoup object from the acquired HTML
soup = BeautifulSoup(response.content, "html.parser")

#First, search for the head with the command using the find method, and find the tr in it._Extract all with all method
tag_thead_tr = soup.find('thead').find_all('tr')

print(tag_thead_tr)

Execution result


[<tr>
 <th>date</th>
 <th>Open price</th>
 <th>High price</th>
 <th>Low price</th>
 <th>closing price</th>
 <th>Volume</th>
 <th>Closing price adjustment</th>
 </tr>]

python


'''
class="stock_table stock_data_table"Stock price data is stored for each tbody directly below
Therefore, you should first get the table class and then get the contents.
'''
#Specify the table, search with findAll, and the 0th element(Although there is only one)Take out
table = soup.findAll("table", {"class":"stock_table stock_data_table"})[0]

#Get the tr tag in it. However, since it is a thread column at first, it is omitted.(1:Part of)* The reason why it is not tbody will be described later.
tag_tbody_tr = table.findAll("tr")[1:]

#Show only the first as an example
print(tag_tbody_tr [0])

Execution result


[<tr>
<td>2020-01-06</td>
<td>7599</td>
<td>7626</td>
<td>7530</td>
<td>7565</td>
<td>6672300</td>
<td>7565</td>
</tr>]

Regarding the stock price part, if you check the actual HTML, it is as shown in the figure below, and if you do findAll with tbody, you can get only the 2020/1/6 part of the first<tbody> ~ </tbody>. This comes from the HTML rule that if you write <tbody> once, you can omit it from the next time. Therefore, this time, unlike the previous article, it was necessary to search for the <tr> tag that is written without omission on any date.

キャプチャ6.PNG

In addition, the brand name and securities code required when storing the DB will be searched for and obtained from the same page. As you can see from the execution result, it was not possible to obtain the securities code alone from this page, but in the actual production, only the first 4 digits need to be extracted.

python


#Header class base_box_jp class in header,You can see that the information is embedded in the span tag.
name = soup.findAll("span", {"class":"jp"})[0]

print(name.get_text())

Execution result


'7203 Toyota Motor(stock)'

3. While taking out the data of all brands, store it in the DB

Since I knew how to take it out by 2, I will write it to the DB while applying it to all brands.

3-1. Think about DB configuration (schema)

How to store as a database? You have to think about (** this is called a schema **). Since this is a simple example, if you suddenly write a ER diagram that represents the DB design diagram, it will look like the one below.

Since the ER diagram creation tool is simple, I used draw.io. Reference: How to make ER diagram [Simplified version]

キャプチャ7.PNG

・ The stock master stores the securities code and stock name (+ it is better if you know the listed market) ・ Enter all the data of each stock in the stock price table (which stock can be identified by code) ・ ** In this article, we will describe the acquisition of price (stock price table) as an example. The name (brand master) is easy, so please refer to this article and try it yourself ** ・ ** Originally, further processing related to "closing price adjustment (stock split, etc.)" is required, but this part is not described in this article **

3-2. Create a function to retrieve information to store in DB and save it as Dataframe once

I think there are various ways to do it from here, but I'm used to pandas, so I thought it would be better to create a stock table with pandas and then convert it to sql with to_sql, so this time I will describe it like that. You don't have to bother with pandas. .. I want you to do it your own way.

** Note that the time.sleep (1) part is the scraping load countermeasure part **. If the number of seconds is specified by the site, you can replace the numbers in the contents.

python


import time
import pandas as pd

def get_brand(code):
    """
A function that gets scraping information from the securities code, converts it to Datafrrame, and returns it.
    """

    #My user agent that I checked with confirmation(Current browser)Copy and paste * Rewrite according to the environment
    headers = {"User-Agent": "Mozilla/*** Chrome/*** Safari/***"}

    #Get the URL of the brand you want to get
    url = 'https://kabuoji3.com/stock/' + str(code) + '/2020/'

    #If the stock price does not exist, return None with exception handling
    try:
        #Create a BeautifulSoup object from the acquired HTML
        response = requests.get(url, headers = headers)
        soup = BeautifulSoup(response.content, "html.parser")
        
        #Get a securities code
        code_name = soup.findAll("span", {"class":"jp"})[0]
        
        #header(column)Get information
        tag_thead_tr = soup.find('thead').find_all('tr')
        head = [h.text for h in tag_thead_tr[0].find_all('th')] 
        
        #Get stock price data and convert it to Dataframe
        table = soup.findAll("table", {"class":"stock_table stock_data_table"})[0]
        tag_tbody_tr = table.findAll("tr")[1:]
        
        data = []
        for i in range(len(tag_tbody_tr)):
            data.append([d.text for d in tag_tbody_tr[i].find_all('td')])
        df = pd.DataFrame(data, columns = head)
            
        #Add a new code column to Dataframe with assign * code_Securities code up to the first 4 digits of name
        df = df.assign(code=code_name.get_text()[:4])
        
        #For debugging. Output the acquired page. * Not required for production code
        print(url)
    
    except (ValueError, IndexError, AttributeError):
        return None
    
    return df

def brands_generator(code_range):
    """
A function that generates a securities code and combines the obtained information
    """

    for i,code in enumerate(code_range):
        #Create a new data frame only for the first time
        if i==0:
            cols = ['date', 'Open price', 'High price', 'Low price', 'closing price', 'Volume', 'closing price調整', 'code']
            df = pd.DataFrame(index=[], columns=cols)

        #Pass the generated securities code to the scraping function
        brand = get_brand(code)
        
        #If the information can be obtained, the information will be combined.
        if brand is not None:
            df = pd.concat([df, brand]).reset_index(drop=True)                                  

        #Stop the program for 1 second(Countermeasures against scraping load)
        time.sleep(1)
        
    return df

"""
Here, as an example, we will acquire securities codes 1301 to 1310.
You can get the production from 1301 to 9999(Naturally it takes time)
"""
df = brands_generator(range(1301,1310))

Execution result


https://kabuoji3.com/stock/1301/2020/
https://kabuoji3.com/stock/1305/2020/
https://kabuoji3.com/stock/1306/2020/
https://kabuoji3.com/stock/1308/2020/
https://kabuoji3.com/stock/1309/2020/

You can see that the URL of the acquired securities code appears as an output result. Of course, 1302 and 1307 are non-existent securities codes, so I couldn't get them, so the answer is correct. You can see that it seems that scraping is done correctly.

python


#Sample display of only the first two lines
df.head(2)

When the result of the obtained df is displayed in Pandas, it can be obtained in chronological order as shown below. The securities code is also stored in the code column part that I added, so there seems to be no problem.

date Open price High price Low price closing price Volume Closing price adjustment code
0 2020-01-06 7599 7626 7530 7565 6672300 7565 1301
1 2020-01-07 7654 7722 7619 7715 4960700 7715 1301

3-3. Pass the Dataframe information to DB and save it

Since I was able to get the data in dataframe format, I will finally convert it to DB.

python


import sqlite3

#Database name.Set by db extension
dbname = ('sample.db')
#Create database
db = sqlite3.connect(dbname, isolation_level=None)
#df to_Convert to database with sql. DB table name is price
df.to_sql('price', db, if_exists='append', index=None)

#Define a cursor object in the database
cursor = db.cursor()
#Was the price table really created? With the sql function
sql = """SELECT name FROM sqlite_master WHERE TYPE='table'"""
for t in cursor.execute(sql):
    print(t)

Execution result


('price',)

You can see that the price table is created properly. Finally, read this database again as a dataframe and see if the contents are the same? To confirm

python


#Read the price table of the created database again with pandas
df_db = pd.read_sql('SELECT * FROM price', db)
#Sample display of only 2 lines as before
df_db.head(2)

Naturally, the result is the same. With this, it was confirmed that the contents of the DB are the same as the dataframe extracted by scraping earlier. (You can check the contents with the sql command, but personally ** pandas is 100 times easier to see **)

date Open price High price Low price closing price Volume Closing price adjustment code
0 2020-01-06 7599 7626 7530 7565 6672300 7565 1301
1 2020-01-07 7654 7722 7619 7715 4960700 7715 1301

python


#Finally connect(connect)Close
db.close()

4. Finally

How was it? Wasn't it surprisingly easy to save to your own DB? Of course, it takes time to acquire all securities codes, but for example, if you acquire it until the end of 2020 and create a batch that stores it in the DB every day from 2021, the DB will be updated automatically. I can do it, and I don't think it will take much effort.

So far, I have shared what I have researched three times regarding investment, but since Qiita is a programming knowledge sharing site, I will refrain from detailed analysis after that. By scraping various sites by yourself who read this article, while stocking your own analysis data, we will do our best together so that we can make a profit in 2021. let's go!

Also, although it is not investment related, I will post machine learning and other articles every week, so if you like, please use LGTM or stock to keep your motivation!

Recommended Posts

Store the stock price scraped by Python in the DB
Read the file line by line in Python
Read the file line by line in Python
Download the file in Python
Find the difference in Python
Use parameter store in Python
Python: Stock Price Forecast Part 2
Get stock price with Python
Sort by date in python
Python: Stock Price Forecast Part 1
[Python] Open the csv file in the folder specified by pandas
Visualize the correlation matrix by principal component analysis in Python
Getting the arXiv API in Python
Python in the browser: Brython's recommendation
Save the binary file in Python
Hit the Sesami API in Python
[Understanding in the figure] Management of Python virtual environment by Pipenv
Check if the password hash generated by PHP matches in Python
Get the desktop path in Python
Automatically resize screenshots for the App Store for each screen in Python
[Python] My stock price forecast [HFT]
Get the script path in Python
Read the standard output of a subprocess line by line in Python
In the python command python points to python3.8
Implement the Singleton pattern in Python
Sort the file names obtained by Python glob in numerical order
Hit the web API in Python
I wrote the queue in Python
Calculate the previous month in Python
Examine the object's class in python
Get the desktop path in Python
Get the host name in Python
[Python] Do not put Japanese in the path used by OpenCV
Access the Twitter API in Python
The first step in Python Matplotlib
Play by hitting the Riot Games API in Python First half
I wrote the stack in Python
Master the weakref module in Python
A memo organized by renaming the file names in the folder with python
How to sort by specifying a column in the Python Numpy array.
Stock price acquisition code by scraping (Selenium)
Learn the design pattern "Prototype" in Python
Learn the design pattern "Builder" in Python
Load the remote Python SDK in IntelliJ
Try using the Wunderlist API in Python
Check the behavior of destructor in Python
[Python] Visualize the information acquired by Wireshark
Learn the design pattern "Flyweight" in Python
Learn the design pattern "Observer" in Python
Learn the design pattern "Memento" in Python
Learn the design pattern "Proxy" in Python
Learn the design pattern "Command" in Python
OR the List in Python (zip function)
Display Python 3 in the browser with MAMP
Tweet using the Twitter API in Python
Learn the design pattern "Visitor" in Python
Download Japanese stock price data with python
Learn the design pattern "Bridge" in Python
Check if the URL exists in Python
Learn the design pattern "Mediator" in Python
Associate the table set in python models.py