[PYTHON] Stock price acquisition code by scraping (Selenium)

1 What about this article?

When conducting technical analysis, etc., stock price data for a set period of each stock is required. In the case of US stocks, it is easy to use Pandas Reader to access the dedicated API to acquire stock price data for a set period, but in the case of Japanese stocks, there is no API that can acquire stock prices for free. </ b> Also, it is standard to perform web scraping to acquire stock price data, but acquisition of Japanese stocks and most sites that post stock prices prohibit scraping. </ b> By the way, Yahoo Finance is prohibited from scraping. For Japanese stocks, the site called Stock Investment Memo is web scraping OK, and you can get stock price data of individual stocks. However, Stock Investment Memo does not support index-type (Nikkei 225) or US stocks.

Therefore, the stock price data of individual Japanese stocks is acquired by Stock Investment Memo, and the index system such as Nikkei 225 and US stocks are acquired by pandas_datareader. It was.

Stock investment memo site </ b> 92.JPG

2 I will post the code.

The following is the code to acquire the stock prices of individual Japanese stocks, index system (Nikkei 225) and US stocks. We are making it possible to acquire stock prices for the past three years against the record date. Individual Japanese stocks will be scraped from the stock investment memo site. Index systems and US stocks use pandas_datareader to access the API and acquire stock price data. Scraping is slower than accessing the API. Therefore, when acquiring stock price data of Japanese stocks, if the stock has already acquired the stock price, the amount of scraping is reduced so that the processing is completed earlier.

test.py


#-*- coding:utf-8 -*-
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.chrome.options import Options
import time
import pandas as pd
import datetime
from pyvirtualdisplay import Display
from selenium.webdriver.common.by import By
import os.path
import pandas_datareader.data as web #Load the US stock data acquisition library


####(1)Stock price number to acquire,Enter the acquisition base year.#######

STOCKNUMS=[4800,1570,7752,'^N225','VOOV','^DJI','JPY=X']   #Enter the stock price number to get.
#STOCKNUMS=[1570]
YEAR=2020 #Enter the acquisition base year. Acquire stock price data for the past two years from the acquisition base year.

####(2)Set the chrome driver.#######

options = Options() 
options.add_argument('--headless')
options.add_argument('--disable-gpu')

####(3)No X environment(Conducted on the console)If so, enable the following display command. When executing with jupyter, disable the following display command#######

#How to use Selenium by launching a virtual display (Xvfb) using the pyvirtualdisplay package
display = Display(visible=0, size=(1024, 1024))
display.start()


####(4)Set various folders##############

EXECUTABLE_PATH="xxxxxxxxxxxxxxxxxx" #/chromedriver.Specify the exe path.
HOME_PATH="yyyyyyyyyyyyyyyyyy" #Specify the location to save the stock price data.


#### (5)Stock price acquisition class GET_KABUDATA #####

class GET_KABUDATA():

    ### (5-1)Define a constructor###
    def __init__(self,stocknum,year):
    
        stocknum=str(stocknum)

                
        try:
            print('Brand number' + str(stock)+'Get stock price data for')

            #csv file does not exist(Brands to acquire data for the first time this time)If
            if os.path.exists(HOME_PATH + str(stocknum)+'.csv')==False:
                #stocknum(trading name)Is not a numeric type(=If it is not a Japanese individual stock)
                if stocknum.isnumeric()==False:
                    self.get_PandasDR(stocknum) #pandas_Run reader

                #stocknum(trading name)If is a numeric type(=For individual Japanese stocks)
                else:
                    self.get_new_stockdat(stocknum,year) #Scrap the data for the past 3 years from the stock investment memo.
            
            #If the csv file exists(For stocks for which data has been acquired in the past)           
            else:
                #stocknum(trading name)Is not a numeric type(=If it is not a Japanese individual stock)
                if str(stocknum).isnumeric()==False:
                    self.get_PandasDR(stocknum) #pandas_Run reader
                
                #stocknum(trading name)If is a numeric type(=For individual Japanese stocks)
                else:
                    self.get_add_stockdat(stocknum,year) #Get the stock price data of the date you do not currently own from the stock investment memo and add it to the csv file.

            print('Brand number' + str(stock)+'Completed stock price data acquisition')
            print('**********')

        #Error handling(A non-existent stock number was specified. The stock investment memo site is down and data cannot be scraped) 
        except Exception as e:
            print(e)
            print('Stock price acquisition failure')                
                
                
    ### (5-2)In the case of a stock whose stock price has been acquired in the past, the difference between the stock price data on the server and the acquired stock price data is compared, and only the difference is acquired and written to the csv file.
    def get_add_stockdat(self,stocknum,year):
        

        #(5-2-1)Initialize various variables.
        s_date=[]
        s_open=[]
        s_high=[]
        s_low=[]
        s_close=[]
        s_volume=[]
        dfstock=[]
        add_s_date=[]
        add_s_open=[]
        add_s_high=[]
        add_s_low=[]
        add_s_close=[]
        add_s_volume=[]  
        add_s_stock=[] 
        add_dfstock=[]       
        
        #(5-2-2)Access the stock price list of individual stocks displayed in the stock investment memo(Do scraping)。
        browser = webdriver.Chrome(options=options,executable_path=EXECUTABLE_PATH)
        url='https://kabuoji3.com/stock/'+ str(stocknum) + '/'+ str(year) + '/'
        browser.get(url)
        elem_tmp0 = browser.find_element_by_class_name('data_contents')
        elem_tmp1 = elem_tmp0.find_element_by_class_name('data_block')
        elem_tmp2 = elem_tmp1.find_element_by_class_name('data_block_in')
        elem_tmp3 = elem_tmp2.find_element_by_class_name('table_wrap')
        elem_table= elem_tmp3.find_element_by_class_name('stock_table.stock_data_table')
        elem_table_kabuka=elem_table.find_elements(By.TAG_NAME, "tbody")

        #(5-2-3)Specify each line of the stock price table and read the stock price of each day
        for i in range(0,len(elem_table_kabuka)):
    
            kabudat=elem_table_kabuka[i].text.split()   
            s_date.append(str(kabudat[0].split('-')[0]) +'/'+ str(kabudat[0].split('-')[1]) +'/'+ str(kabudat[0].split('-')[2])) #Get date
            s_open.append(kabudat[1]) #Get the opening price
            s_high.append(kabudat[2]) #Get the high price.
            s_low.append(kabudat[3]) #Get the low price.
            s_close.append(kabudat[4]) #Get the closing price.
            s_volume.append(kabudat[5]) #Get the volume.
            s_stock={'DATE':s_date,'CLOSE':s_close,'OPEN':s_open,'HIGH':s_high,'LOW':s_low,'VOL':s_volume} #Open price,High price,Low price,closing price,Add volume to list


        dfstock=pd.DataFrame(s_stock,columns=["DATE","CLOSE","OPEN","HIGH","LOW","VOL"]) #List s_Convert stock to DataFrame.
        dfstock.set_index("DATE",inplace=True)
        dfstock=dfstock.sort_index() #Arrange the acquired stock price data in chronological order.
        dfstock.reset_index("DATE",inplace=True)

                
        dfstock_csv= pd.read_csv(HOME_PATH + str(stocknum)+'.csv', index_col=0) #Read the stock price data saved on the server from the csv file.
        dfstock_csv.reset_index("DATE",inplace=True)  #Cancel index specification
        

        #(5-2-4)Get the latest date of newly scraped stock price data from the site
        dfstock_latest = dfstock['DATE'].iloc[dfstock['DATE'].count()-1]
        dfstock_latest=datetime.datetime.strptime(dfstock_latest, '%Y/%m/%d') #Set the date to character type
        dfstock_latest_date=datetime.date(dfstock_latest.year, dfstock_latest.month, dfstock_latest.day) #Change the date from character type to date type
        
        
        #(5-2-5)Get the latest date of stock price data stored on the server.
        dfstock_csv_latest = dfstock_csv['DATE'].iloc[dfstock_csv['DATE'].count()-1]
        dfstock_csv_latest=datetime.datetime.strptime(dfstock_csv_latest, '%Y/%m/%d') #Set the date to character type
        dfstock_csv_latest_date =datetime.date(dfstock_csv_latest.year, dfstock_csv_latest.month, dfstock_csv_latest.day) #Change the date from character type to date type
      
        #(5-2-6)Calculate the difference between the latest date of the stock price data newly scraped from the site and the latest date of the stock price data stored on the server.
        difday=dfstock_latest_date - dfstock_csv_latest_date 


       
        #(5-2-7)Add the shortage stock price data to the stock price data stored on the server to update it.
        for i in range(len(elem_table_kabuka)-difday.days,len(elem_table_kabuka)):
            

            kabudat=elem_table_kabuka[i].text.split()   
            add_s_date.append(str(kabudat[0].split('-')[0]) +'/'+ str(kabudat[0].split('-')[1]) +'/'+ str(kabudat[0].split('-')[2]))     
            add_s_open.append(kabudat[1])
            add_s_high.append(kabudat[2])
            add_s_low.append(kabudat[3])
            add_s_close.append(kabudat[4])
            add_s_volume.append(kabudat[5])
            add_s_stock={'DATE':add_s_date,'CLOSE':add_s_close,'OPEN':add_s_open,'HIGH':add_s_high,'LOW':add_s_low,'VOL':add_s_volume} 
            

        #Convert shortage stock price data from list format to DataFrame format
        add_dfstock=pd.DataFrame(add_s_stock,columns=["DATE","CLOSE","OPEN","HIGH","LOW","VOL"])    

        #(5-2-8)Add the shortage stock price data to the stock price data stored on the server.
        dfstock=pd.concat([dfstock_csv, add_dfstock])  

        #(5-2-9)Export the stock price data with the updated amount to csv
        dfstock.set_index("DATE",inplace=True)
        dfstock.to_csv(HOME_PATH + str(stocknum)+'.csv')
        
        
        browser.close()#If you do not write outside the for statement that closes the browser, an error will occur.
        
 
     ### (5-3)Acquire new stock price data.
    def get_new_stockdat(self,stocknum,year):

         #(5-3-1)Initialize various variables.
        s_date=[]
        s_open=[]
        s_high=[]
        s_low=[]
        s_close=[]
        s_volume=[]
        dfstock=[]

        #(5-3-2)Access the stock price list of individual stocks displayed in the stock investment memo(Do scraping)。
        browser = webdriver.Chrome(options=options,executable_path=EXECUTABLE_PATH)

        #(5-3-3)Get stock price data for the past 3 years from stock investment memo
        for j in range(0,3):
            url='https://kabuoji3.com/stock/'+ str(stocknum) + '/'+ str(year-j) + '/'
            browser.get(url)
            elem_tmp0 = browser.find_element_by_class_name('data_contents')
            elem_tmp1 = elem_tmp0.find_element_by_class_name('data_block')
            elem_tmp2 = elem_tmp1.find_element_by_class_name('data_block_in')
            elem_tmp3 = elem_tmp2.find_element_by_class_name('table_wrap')
            elem_table= elem_tmp3.find_element_by_class_name('stock_table.stock_data_table')
            elem_table_kabuka=elem_table.find_elements(By.TAG_NAME, "tbody")

             #(5-2-4)Specify each line of the stock price table and read the stock price of each day
            for i in range(0,len(elem_table_kabuka)):
    
                kabudat=elem_table_kabuka[i].text.split()   
                s_date.append(str(kabudat[0].split('-')[0]) +'/'+ str(kabudat[0].split('-')[1]) +'/'+ str(kabudat[0].split('-')[2]))     
                s_open.append(kabudat[1])
                s_high.append(kabudat[2])
                s_low.append(kabudat[3])
                s_close.append(kabudat[4])
                s_volume.append(kabudat[5])
                s_stock={'DATE':s_date,'CLOSE':s_close,'OPEN':s_open,'HIGH':s_high,'LOW':s_low,'VOL':s_volume}

                
        dfstock=pd.DataFrame(s_stock,columns=["DATE","CLOSE","OPEN","HIGH","LOW","VOL"]) #list->Change to DataFrame
        dfstock.set_index("DATE",inplace=True)
        dfstock=dfstock.sort_index() #Arrange the acquired stock price data in chronological order.
        dfstock.to_csv(HOME_PATH + str(stocknum)+'.csv') #Export stock price data to a CSV file.

        browser.close()#If you do not write outside the for statement that closes the browser, an error will occur.

        
 

     ###Acquire stock price data of US stocks and indexes.
    def get_PandasDR(self,stocknum):
        ed=datetime.datetime.now()
        st=datetime.datetime.now()- datetime.timedelta(days=600)        
        df=web.DataReader(stocknum, 'yahoo',st,ed) #Get Nikkei 225 stock price data
        df=df.drop(columns='Adj Close') #Delete column A.
        df.reset_index("Date",inplace=True)
        df= df.rename(columns={'Date': 'DATE','High': 'HIGH', 'Low': 'LOW',  'Open': 'OPEN', 'Close': 'CLOSE',  'Volume': 'VOL' })#Change each column name to the desired column name.
        df = df[['DATE','CLOSE','HIGH','LOW','OPEN','VOL']]
        df.set_index("DATE",inplace=True)
        df.to_csv(HOME_PATH + str(stocknum)+'.csv') #Write df to external csv file
        
        
        
 
#############Main program##################

#Acquires the stock price of the specified stock.
if __name__ == '__main__':        
    
    for stock in STOCKNUMS:
        GET_KABUDATA(stock,YEAR)

    print('Stock price acquisition has been completed.')

3 Points to note when executing code

3-1 Setting of brand to be acquired

Enter the stock number of the stock price to be acquired in the variable STOCKNUMS </ b>.

python


STOCKNUMS=[4800,1570,7752,'^N225','VOOV','^DJI','JPY=X']   #Enter the stock price number to get.
#STOCKNUMS=[1570]
YEAR=2020 #Enter the acquisition base year. Acquire stock price data for the past two years from the acquisition base year.

^ N225: Nikkei 225, VOOV: Vanguard S & P 500 Value ETF, ^ DJI: NY Dow, JPY = X: Yen USD

3-2 Scraping settings

If you want to execute the code in the environment where the browser does not start (console environment), enable the following code. On the contrary, if it is an environment where the browser starts up such as jupyter, please disable the following.

python


display = Display(visible=0, size=(1024, 1024))
display.start()

3-3 PATH setting

Set the path (EXECUTABLE_PATH) where the chrome driver chromedriver.exe is located and the path (HOME_PATH) where the stock price data is saved.

python


EXECUTABLE_PATH="xxxxxxxxxxxxxxxxxx" #/chromedriver.Specify the exe path.
HOME_PATH="yyyyyyyyyyyyyyyyyy" #Specify the location to save the stock price data.