[PYTHON] Let's automatically collect company information (XBRL data) using the EDINET API (4/10)

In the 4th post of Advent Calendar, I will programmatically collect the data described in XBRL disclosed on EDINET.

(The program in this article is provided as it is without any guarantee, and XBRL Japan takes all responsibility for any disadvantages or problems caused by using this program, regardless of the cause. I will not bear it.)

1. What is EDINET API?

The EDINET API is an API that allows you to efficiently retrieve (XBRL) data from the EDINET database via a program, not from the EDINET screen. The EDINET API enables EDINET users to efficiently acquire disclosure information. Before using the API, please check the ** Terms of Service ** on the EDINET page.

スクリーンショット 2019-12-01 16.47.38.png

2. Collect XBRL data with EDINET API

2.1 Program Overview

This program is a Python language program that downloads the XBRL data corresponding to the securities report disclosed on EDINET during the collection period via the EDINET API. (All codes are described in "3. Source code") ** Detailed specifications of EDINET API **, please check from the EDINET site. .. Quarterly / semi-annual reports and corrected securities reports are not supported.

2.2 Preparation

Please take the following actions before executing the program. In addition, it is necessary to install other libraries (requests, datetime, etc.) in advance.

2.2.1 Endpoint settings

Endpoints are as of December 2019. Please check each time for the latest version (ex. V1). https://disclosure.edinet-fsa.go.jp/api/v1/documents.json

スクリーンショット 2019-12-01 18.01.14.png ### 2.2.2 Setting the XBRL collection period Change the dates of `start_date` (collection start date) and ʻend_date` (collection end date) according to the period for which you want to collect XBRL data. For start_date, you can specify a date up to 5 years before the program execution date. `start_date = datetime.date(2019, 11, 1)` `end_date = datetime.date(2019, 11,30)`

2.2.3 Proxy settings

Set Proxy based on the network environment. If you don't need Proxy, remove proxies = proxies. "http_proxy" : "http://username:[email protected]:8080" "https_proxy" : "https://username:[email protected]:8080"

2.2.4 Determining the output folder

Decide where to download the XBRL data. C://Users//xxx//Desktop//xbrlReport//SR//

2.3 Execution result

The day_list is created by setting the collection start date and collection end date.

Code1


start_date = datetime.date(2019, 11, 1)
end_date = datetime.date(2019, 11,30)

Result1


day_list [datetime.date(2019, 11, 1), datetime.date(2019, 11, 2), datetime.date(2019, 11, 3), datetime.date(2019, 11, 4), datetime.date(2019, 11, 5), datetime.date(2019, 11, 6), datetime.date(2019, 11, 7), datetime.date(2019, 11, 8), datetime.date(2019, 11, 9), datetime.date(2019, 11, 10), datetime.date(2019, 11, 11), datetime.date(2019, 11, 12), datetime.date(2019, 11, 13), datetime.date(2019, 11, 14), datetime.date(2019, 11, 15), datetime.date(2019, 11, 16), datetime.date(2019, 11, 17), datetime.date(2019, 11, 18), datetime.date(2019, 11, 19), datetime.date(2019, 11, 20), datetime.date(2019, 11, 21), datetime.date(2019, 11, 22), datetime.date(2019, 11, 23), datetime.date(2019, 11, 24), datetime.date(2019, 11, 25), datetime.date(2019, 11, 26), datetime.date(2019, 11, 27), datetime.date(2019, 11, 28), datetime.date(2019, 11, 29), datetime.date(2019, 11, 30)]
2

Loop through day_list, set ʻurl (endpoint), params(date information), and optionallyproxies (Proxy information) for each date, requests. Get res (Response object) by executing get (url, params = params, proxies = proxies). ʻUrl specifies the endpoint corresponding to the document list API. The reason why the list of documents to be submitted is obtained by specifying " type ": 2 in params is to identify the securities report in the subsequent processing.

Code2



for index,day in enumerate(day_list):
    url = "https://disclosure.edinet-fsa.go.jp/api/v1/documents.json"
    params = {"date": day, "type": 2}

    proxies = {
        "http_proxy" : "http://username:[email protected]:8080/",
        "https_proxy" : "https://username:[email protected]:8080/"
    }

  res = requests.get(url, params=params ,proxies=proxies)

The structure of res is defined by the 2-1-2-1 Document List API (metadata) of the EDINET API specification.pdf. The following is the contents of res corresponding to day of 2019.11.1.

Result2




2019-11-01
{
    "metadata": {
        "title": "API for grasping submitted documents",
        "parameter": {
            "date": "2019-11-01",
            "type": "2"
        },
        "resultset": {
            "count": 315
        },
        "processDateTime": "2019-12-05 00:00",
        "status": "200",
        "message": "OK"
    },
    "results": [
        {
            "seqNumber": 1,
            "docID": "S100H5LU",
            "edinetCode": "E12422",
            "secCode": null,
            "JCN": "4010001046310",
            "filerName": "Shinkin Asset Management Investment Trust Co., Ltd.",
            "fundCode": "G03385",
            "ordinanceCode": "030",
            "formCode": "07A000",
            "docTypeCode": "120",
            "periodStart": "2018-08-07",
            "periodEnd": "2019-08-06",
            "submitDateTime": "2019-11-01 09:00",
            "docDescription": "Securities Report (Domestic Investment Trust Beneficiary Securities) -17th Term(August 7, 2018-August 6, 2018-Reiwa 1)",
            "issuerEdinetCode": null,
            "subjectEdinetCode": null,
            "subsidiaryEdinetCode": null,
            "currentReportReason": null,
            "parentDocID": null,
            "opeDateTime": null,
            "withdrawalStatus": "0",
            "docInfoEditStatus": "0",
            "disclosureStatus": "0",
            "xbrlFlag": "1",
            "pdfFlag": "1",
            "attachDocFlag": "1",
            "englishDocFlag": "0"
        },
        

Since the list of documents to be submitted is managed by results of res, loop processing is performed using results. After that, you will get ʻordinance Code(Cabinet Office Ordinance Code) andform_code (Form Code) for each document submitted in results. Since this time we are targeting securities reports, we have decided to process only the submitted documents with ʻordinance Code of 010 and form_code of 030000. Obtain the docID (document management number) of the relevant submitted documents and store it in the securities_report_doc_list (list of securities reports).

Code3



for num in range(len(json_data["results"])):
        ordinance_code= json_data["results"][num]["ordinanceCode"]
        form_code= json_data["results"][num]["formCode"]

        if ordinance_code == "010" and  form_code =="030000" :
            securities_report_doc_list.append(json_data["results"][num]["docID"])

This created a list of docIDs corresponding to the securities report.

Result3


number_of_lists: 77
get_list: ['S100H8TT', 'S100HE9U', 'S100HC6W', 'S100HFA0', 'S100HFBC', 'S100HFB3', 'S100HG9S', 'S100HG62', 'S100HGJL', 'S100HFMG', 'S100HGM1', 'S100HGMZ', 'S100HGFM', 'S100HFC2', 'S100HGNQ', 'S100HGS3', 'S100HGYR', 'S100HGMB', 'S100HGKE', 'S100HFJG', 'S100HGTC', 'S100HH1G', 'S100HH9I', 'S100HGTF', 'S100HHAL', 'S100HHC0', 'S100HFIB', 'S100HH1I', 'S100HH36', 'S100HHDF', 'S100HH9L', 'S100HHGB', 'S100HHGJ', 'S100HHCR', 'S100HHJJ', 'S100HHH0', 'S100HHLH', 'S100HHL6', 'S100HHD4', 'S100HHM7', 'S100HHL9', 'S100HHN6', 'S100HHO8', 'S100HHHV', 'S100HHE3', 'S100HGB5', 'S100HHQ0', 'S100HHP5', 'S100HHMK', 'S100HHE6', 'S100HHPR', 'S100HHDA', 'S100HHR7', 'S100HHSB', 'S100HHML', 'S100HH9H', 'S100HH2F', 'S100H8W1', 'S100HHRP', 'S100HHTM', 'S100HHAF', 'S100HHUD', 'S100HHK9', 'S100HHT4', 'S100HHCI', 'S100HHXQ', 'S100HHO8', 'S100HHSS', 'S100HHRL', 'S100HI19', 'S100HHXS', 'S100HI1W', 'S100HHSP', 'S100HHN4', 'S100HI3J', 'S100HI3K', 'S100HI4G']

The following is the code to download the XBRL data using the list. Use securities_report_doc_list to loop. ʻUrlspecifies the endpoint corresponding to the document acquisition API (note that it is not the document list API). By specifying" type ": 1 in params, it is possible to obtain the submitted document and audit report. Only when the status code of res` is 200 (when the request is successful), the XBRL data is downloaded. In addition, if you specify a date and time that is not covered by the EDINET period, such as 5 years ago, the status code of 404 (resource does not exist) may be returned.

Code4


for index,doc_id in enumerate(securities_report_doc_list):
    url = "https://disclosure.edinet-fsa.go.jp/api/v1/documents/" + doc_id
    params = {"type": 1}
    filename = "C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\" + doc_id + ".zip"
    res = requests.get(url, params=params ,stream=True)

    if res.status_code == 200:
        with open(filename, 'wb') as file:
            for chunk in res.iter_content(chunk_size=1024):
                file.write(chunk)

After execution, 77 zip files were downloaded to the specified folder. スクリーンショット 2019-12-01 17.52.36.png Unzip the zip file and you will see the familiar AuditDoc and PublicDoc folders. This completes the download of XBRL data. スクリーンショット 2019-12-01 18.09.54.png

3. Source code

# -*- coding: utf-8 -*-
import requests
import datetime


def make_day_list(start_date, end_date):
    print("start_date:", start_date)
    print("end_day:", end_date)

    period = end_date - start_date
    period = int(period.days)
    day_list = []
    for d in range(period):
        day = start_date + datetime.timedelta(days=d)
        day_list.append(day)

    day_list.append(end_date)

    return day_list


def make_doc_id_list(day_list):
    securities_report_doc_list = []
    for index, day in enumerate(day_list):
        url = "https://disclosure.edinet-fsa.go.jp/api/v1/documents.json"
        params = {"date": day, "type": 2}

        proxies = {
            "http_proxy": "http://username:[email protected]:8080",
            "https_proxy": "https://username:[email protected]:8080"
        }

        res = requests.get(url, params=params, proxies=proxies)
        json_data = res.json()
        print(day)

        for num in range(len(json_data["results"])):

            ordinance_code = json_data["results"][num]["ordinanceCode"]
            form_code = json_data["results"][num]["formCode"]

            if ordinance_code == "010" and form_code == "030000":
                print(json_data["results"][num]["filerName"], json_data["results"][num]["docDescription"],
                      json_data["results"][num]["docID"])
                securities_report_doc_list.append(json_data["results"][num]["docID"])

    return securities_report_doc_list


def download_xbrl_in_zip(securities_report_doc_list, number_of_lists):
    for index, doc_id in enumerate(securities_report_doc_list):
        print(doc_id, ":", index + 1, "/", number_of_lists)
        url = "https://disclosure.edinet-fsa.go.jp/api/v1/documents/" + doc_id
        params = {"type": 1}
        filename = "C://Users//xxx//Desktop//xbrlReport//SR//" + doc_id + ".zip"
        res = requests.get(url, params=params, stream=True)

        if res.status_code == 200:
            with open(filename, 'wb') as file:
                for chunk in res.iter_content(chunk_size=1024):
                    file.write(chunk)

def main():
    start_date = datetime.date(2019, 11, 1)
    end_date = datetime.date(2019, 11, 30)
    day_list = make_day_list(start_date, end_date)

    securities_report_doc_list = make_doc_id_list(day_list)
    number_of_lists = len(securities_report_doc_list)
    print("number_of_lists:", len(securities_report_doc_list))
    print("get_list:", securities_report_doc_list)

    download_xbrl_in_zip(securities_report_doc_list, number_of_lists)
    print("download finish")


if __name__ == "__main__":
    main()

4. How to collect reports other than securities reports

This time, we targeted securities reports, but by changing ʻordinanceCode and form_code, you can automatically collect report data written in other XBRL. For example, in the case of a quarterly report, you can change the system so that only the submitted documents with ʻordinanceCode of 010 and form_code of 043000 are processed. The following is a brief summary including corrections.

〇 Securities report if ordinanceCode == "010" and formCode =="030000" :

〇Corrected securities report if ordinanceCode == "010" and formCode =="030001" :

〇 Quarterly report if ordinanceCode == "010" and formCode =="043000" :

〇Corrected quarterly report if ordinanceCode == "010" and formCode =="043001" :

For the prefectural ordinance code and form code of all forms, see EDINET API-related materials (released on March 17, 2019). Please refer to Attachment 1_Form Code List.xlsx included in the zip file downloaded from [API Specification].

5. Contact

For inquiries regarding this article, please contact the following e-mail address. e-mail:[email protected] (Of course, comments on qiita are also welcome)

This e-mail address will be the contact point for inquiries about the Development Committee of XBRL Japan, which writes the article for qiita. I will. Therefore, we cannot answer general inquiries about the organization depending on the content, but please feel free to contact us with any technical questions, opinions, requests, advice, etc. regarding XBRL. Please note that it may take some time to respond because the committee members are volunteers.

Recommended Posts

Let's automatically collect company information (XBRL data) using the EDINET API (4/10)
Let's automatically extract employee information such as average salary from the XBRL data disclosed on EDINET (5/10)
Collect product information and process data using Rakuten product search API [Python]
I tried using the API of the salmon data project
Follow the XBRL taxonomy display link using the OSS Arrele API
Obtain vulnerability information using the REST API published by NVD
[Python] I tried collecting data using the API of wikipedia
Let's publish the super resolution API using Google Cloud Platform
Try using the Twitter API
Try using the Twitter API
Try using the PeeringDB 2.0 API
[Python] I tried to get various information using YouTube Data API!
Get Salesforce data using REST API
Information extraction from EDINET XBRL files
Get Amazon data using Keep API # 1 Get data
Data acquisition memo using Backlog API
Let's display the map using Basemap
I tried using the checkio API
Collect video information of "Singing with XX people" [Python] [Youtube Data API]