Until you get daily data for multiple years of Japanese stocks and save it in a single CSV (Python)

Introduction

This article is the 22nd day article of SFC Advent Calendar 2019. I was wondering what to write about, but this time I decided to rewrite the code I wrote recently and make it an article.

background

In recent years, with the development of various software and libraries, stock price data has come to be used by individuals. However, the stock price data of Japanese stocks is difficult to handle due to copyright law issues (see here for details), so a large amount of stock price data is currently available. There is no service that allows you to download all of them for free.

However, there are some services that allow you to download daily data for each issue for one year. For example, Stock Investment Memo and Buffet Code.

These services are very useful when you want to analyze stock prices for one year, but they are a little inconvenient because you have to process a large number of files when you want to analyze stock prices for multiple years.

Therefore, this time, I will make it possible to use stock price data for multiple years by creating a code that combines multiple CSVs downloaded from Stock Investment Memo. Let's do it. Also, by using the DataFrame of Pandas, which is a Python library, it is designed to be easy to use when performing analysis with Python as it is.

About data download

If you read the above, you may be wondering, "Is there a code for the download part?" In that regard, of course, it is possible to automate, but since it is unclear whether scraping and automation are permitted on the stock investment memo's site, I will not introduce the code in this article.

Instead, Longing for freelance is introduced this code You can fully automate the download by using.

In addition, there are two points that need to be changed at that time, so they are described below.

About the file storage location

It is recommended to replace download_dir ='C: \\ Python \\' with download_dir ='./csv'. By doing this, you can place the downloaded file in the folder directly under the current directory.

About iterative processing

If you want to get the data up to 2019 where it is for i in range (year, 2019):, get the data up to for i in range (year, 2020):, 2020 If you want to, you need to increase it by one year like for i in range (year, 2021):.

Data preparation

First of all, download the stock price data of a specific stock from Stock Investment Memo for multiple years using the method described above and save it.

important point

-Be sure to take time to download __data and be careful not to put a load on the server. __ ・ If there is a leak in __ data, an error will occur, so be sure to download all the data for the required number of years. __ -Do not change the file name. __ - Collect all files in one folder, and do not put data of multiple brands in the same folder. __

Reading data from CSV

Create the code to read the downloaded csv.

code

import pandas as pd
import codecs

def read_csv(file_name: str) -> pd.core.frame.DataFrame:
    '''
    kabuoji3.File name of csv downloaded from com[file_name]Specify and read, format and DataFrame[df]Returns as.
    '''
    with codecs.open(file_name, 'r', 'Shift_JIS', 'ignore') as f:
        df = pd.read_csv(f)
    df.columns = df.iloc[0]
    df.drop(df.index[[0,1]],inplace=True)
    df.index = pd.to_datetime(df.index, format='%Y-%m-%d')
    df.index.name = "date"
    return df

Commentary

The CSV character code that can be downloaded in the stock investment memo is Shift_JIS, so if you read the CSV in the usual way, garbled characters may occur. To prevent this, here, codecs is used to specify the character code to open the file, and pandas function read_csv () is used to read the CSV as DataFrame. In addition, the CSV file downloaded in the stock investment memo contains brand information in the header part, and columns and ʻindexare not recognized correctly, so it includes the process of deleting the header after directly specifying these. I will. Furthermore, by performingto_datetime processing on the date column of ʻindex, the date of ʻindex can be treated as it is as datetimetype. The final output is theDataFrame of pandas`, which contains the stock price data for one year.

Read file list

Create a list of CSV filenames in a single folder, as well as recognize and retrieve how many years stock price data each file contains.

code

from glob import glob

def get_price_data_by_year(year: int) -> pd.core.frame.DataFrame:
    '''
File list in the folder[FILES_DICT]Designated year from[year]Refer to the file name of the stock price csv of, read and DataFrame[df]Returns as.
    '''
    file_name = FILES_DICT[str(year)]
    df = read_csv(file_name)
    return df

if __name__ == "__main__":
    
    #Specify the directory of the folder containing the downloaded CSV (relative path or absolute path is acceptable, after the folder name/*Don't forget)
    CSV_FOLDER_DIRECTORY = './csv/*'

    #CSV above_FOLDER_CSV file dictionary FILES based on DIRECTORY_Part that automatically creates DICT (no editing required)
    FILES_DICT = {}
    files = glob(CSV_FOLDER_DIRECTORY)
    files.sort()
    for file_name in files:
        FILES_DICT[file_name[-8:-4]] = file_name

Commentary

Specify the folder where CSV is collected in the part of CSV_FOLDER_DIRECTORY ='./csv/*'. You can specify either a relative path or an absolute path, but be sure to add / * after the folder name. In the rest of the process, we will create a CSV file dictionary FILES_DICT. This should be defined as a global variable so that it can be referenced from within the function. The CSV that can be downloaded in the stock investment memo is saved with a name such as 7203_2012.csv. By extracting the 2012 (years) part from this, creating aDictionary with the key and the file name value, a CSV file containing stock price data for a specific year You will be able to easily refer to the name.

Data combination (year specified)

Stock price data for several years from the specified year to the specified year are read in order, combined, and output as one data.

code

def create_historical_data(open: int,last: int) -> pd.core.frame.DataFrame:
    '''
Designated year[open]Designated year from[last]Read stock price data up to and combine them into one DataFrame[df]Returns as.
    '''
    df = get_price_data_by_year(open)
    for i in range(int(open) + 1,int(last) + 1):
        df = pd.concat([df, get_price_data_by_year(i)])
    return df

Commentary

Stock price data from the specified year ʻopen to the specified year lastare read in order, and the obtainedDataFrame is combined using pd.concat () . Finally, a DataFrame` with all the data combined is output. If the file of the specified year does not exist, an error will occur. Please make sure that you have downloaded the data for the specified number of years in advance before executing.

Combine data (specify years)

Creates stock price data for the specified number of years retroactively from the execution date and outputs it as one data. This is output in units of dates, but if the date that is exactly the specified number of years retroactively is not the business day of the exchange, the DataFrame containing the stock price data from the business day immediately after that to the present is output. Will be done. Similarly, if the execution date is not the business day of the exchange, a DataFrame containing the data up to the previous business day will be output.

code

import datetime as dt
from dateutil import relativedelta

def create_historical_data_by_date(years: int) -> pd.core.frame.DataFrame:
    '''
Just the specified number of years from the execution date[years]Create minute stock price data and one DataFrame[df]Returns as.
    '''
    this_year = int(dt.datetime.now().year)
    df = create_historical_data(this_year - years,this_year)
    open = dt.datetime.now() - relativedelta.relativedelta(years=years)
    df = df[df.index >= open]
    return df

Commentary

First, use datetime to get the current year, and then usecreate_historical_data ()above to get the stock price data for the specified number of years years. Next, using relativedelta, we get the date that is exactly the specified number of years from the present in datetime type. Since the index of the DataFrame obtained in the previous step has already been converted to the datetime type, you can easily filter the DataFrame by using the comparison operator > =. You can retrieve data for the specified number of years. The final output is a DataFrame that contains stock price data for the specified number of years.

Export to CSV (Save)

I was able to output the data I wanted to use in the above section, but in order to handle this data externally, I need to save it in a file. Here, as an example, we will introduce how to save to CSV.

code

if __name__ == "__main__":

    #When you want to save stock price data from the specified year to the specified year
    df = create_historical_data(2015,2019)
    df.to_csv("2015-2019.csv")

    #When you want to save stock price data for the specified number of years retroactively from the execution date
    df = create_historical_data_by_date(5)
    df.to_csv("5years_price.csv")

Commentary

First, use the functions such as create_historical_data () and create_historical_data_by_date () created in the above section, and save the obtained result in the variable df. This df can be easily exported as CSV usingto_csv (), which is a function of pandas. At this time, it is necessary to specify the file name to be saved, and pass the character string enclosed in " " as the argument of to_csv (). Be sure to include .csv at the end of this. The character code of the file saved at this time is ʻUTF-8, and the delimiter is ,. When importing with Excel, select Data> Text File` and specify the character code and delimiter.

Code (Summary)

The code used this time is published on GitHub. If you are interested, please check here [https://github.com/shota4/jp_stock_price_data/blob/master/edit_price_data.py).

Finally

Until the end Thank you for reading. We hope that your stock analysis and Christmas will be enriched.