[PYTHON] Let's automatically extract employee information such as average salary from the XBRL data disclosed on EDINET (5/10)

In the 5th post of Advent Calendar, I will try to extract the desired information from the XBRL data acquired by the EDINET API. This time, I tried to target the average salary and average age of companies that are often seen in job hunting.

(The program in this article is provided as it is without any guarantee, and XBRL Japan takes all responsibility for any disadvantages or problems caused by using this program, regardless of the cause. I will not bear it.)

1. Extract specific information from XBRL data

1.1 Program Overview

This is a Python language program that decompresses and analyzes the zip file (including XBRL data) acquired by the EDINET API, extracts information such as salary from the securities report, and then outputs the necessary information as CSV. (All codes are listed in "3. Source code") The items to be extracted from the XBRL data are as shown in the table below.

item                                        Element name defined in XBRL                                                     Acquisition example                                    
EDINETCODE EDINETCodeDEI E00004
Company name FilerNameInJapaneseDEI Kaneko Seed Co., Ltd.
Average annual salary (yen) AverageAnnualSalaryInformationAboutReportingCompanyInformationAboutEmployees 5,554,933
Average years of service (years) AverageLengthOfServiceYearsInformationAboutReportingCompanyInformationAboutEmployees 13.0
Average years of service (month) AverageLengthOfServiceMonthsInformationAboutReportingCompanyInformationAboutEmployees -
Average age (years) AverageAgeYearsInformationAboutReportingCompanyInformationAboutEmployees 42.2
Average age (month) AverageAgeMonthsInformationAboutReportingCompanyInformationAboutEmployees -
Number of employees (persons) NumberOfEmployees 625

Other characteristic points are described below. ○ Use the EDINET code list to identify the industry of the company ○ It does not correspond to the acquisition of the company's unique extension item ○ An OSS parser called ʻArellewas used for XBRL analysis. ○ If there is an average length of service (month) and average age (month),annual conversion` (rounded to the first decimal place).

1.2 Preparation

Please take the following actions before executing the program. In addition, it is necessary to install other libraries in advance. Depending on the execution environment, change the encoding and path specification method as appropriate.

1.2.1 Preparation of XBRL data

Download the zip file (including XBRL data) from EDINET by utilizing "Let's search for the data described in XBRL" and "Let's collect the data described in XBRL". The average annual salary (yen) and other items are newly added items on EDINET, so for the time being, 2019 Download the XBRL data disclosed after April 1st.

1.2.2 Installing XBRL Parser

Install the XBRL Parser called Arelle. When ʻArelle` loads the instance, it analyzes with the DTS set including external taxonomy information. Therefore, it is suitable for utilizing advanced XBRL data, but if you use only the instance, it takes time to process, so consider replacing it with the OSS XML library or the parser you are developing. ..

The qiita article also has information on how to install Arelle and other parsers under development, so I will list some of them here. ・ XBRL analysis that does not start from zero (utilization of Arelle) -I made a Python library for EDINET's XBRL --ParserI made a class to download and parse XBRL from UFO CatcherAcquire the securities report XBRL of a listed company and read the value (← Mainly improve the acquisition part of namespace

1.2.3 EDINET code list preparation

You can download it from the ʻEDINET code list` at the bottom of the page after the [Download] tab transition of the EDINET site. This list is listed with ʻEDINET CODE and industry. Therefore, by using this list, you can also acquire the type of business of the company. スクリーンショット 2019-12-01 20.34.47.png

1.2.4 Determining storage / decompression / output folder

○ Decide the storage folder to read the downloaded EDINET code list. edinetcodedlinfo_filepath = 'C://Users//xxx//Desktop//xbrlReport//EdinetcodeDlInfo.csv'

○ Decide the folder to store the zip file obtained by EDINET API. zip_dir = 'C://Users//xxx//Desktop//xbrlReport//SR//'

○ Decide the folder and file name to output CSV. employee_frame.to_csv("C://Users//xxx//Desktop//xbrlReport//xbrl_qiita.csv", encoding='cp932')

1.3 Execution result

Unzip the zip file obtained by EDINET API, and get the list of instance files (.xbrl extension) under the placed PublicDoc folder with glob. (The explanation for the process per zip decompression is omitted)

Code1


xbrl_file_expressions = glob.glob('C:\\Users\\xxx\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\*.xbrl')

Result1


['C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E00112-000_2019-03-31_01_2019-06-27.xbrl', 'C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E01422-000_2019-03-31_01_2019-06-27.xbrl', 'C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E02770-000_2019-03-31_01_2019-06-21.xbrl', 'C:\\Users\\XXX\\Desktop\\xbrlReport\\SR\\XBRL\\PublicDoc\\jpcrp030000-asr-001_E04510-000_2019-03-31_01_2019-06-27.xbrl']

Loop processing is performed using xbrl_files, and XBRL data is read by Arelle's modelManager.load (xbrl_file) for each instance file. As a result, you can get Arelle's model_xbrl object that holds the XBRL data information.

Code2


 for index, xbrl_file in enumerate(xbrl_files):

    ctrl = Cntlr.Cntlr()
    model_manager= ModelManager.initialize(ctrl)
    model_xbrl= modelManager.load(xbrl_file)

Result2



<arelle.ModelXbrl.Model_xbrl object at 0x0EF6F350>

Next, loop processing is performed using model_xbrl.facts in which all facts (actual data such as the numerical values of the elements) included in the instance are defined in list format, and the element name of each fact (fact. get concept.qname.localName). When the element name is the element name to be extracted this time, it is a mechanism to acquire the numerical value (fact.value) of the element. When processing EDINETCODE, incorporate the logic to identify the industry of the company from ʻEdinetcodeDlInfo.csv. Regarding the number of employees, since the same element name exists multiple times, the current period (CurrentYearInstant) is specified in the period attribute called context`.

Code3


for fact in model_xbrl.facts:
    if fact.concept.qname.localName == 'EDINETCodeDEI':
        edinet_code = fact.value

        for code_name in edinet_info_list:
            if code_name[0] == edinet_code:
                industry_code = code_name[1]
                break

 elif fact.concept.qname.localName=='FilerNameInJapaneseDEI':
 elif fact.concept.qname.localName=='AverageAnnualSalaryInformationAboutReportingCompanyInformationAboutEmployees':
 elif fact.concept.qname.localName=='AverageLengthOfServiceYearsInformationAboutReportingCompanyInformationAboutEmployees':
 elif fact.concept.qname.localName=='AverageLengthOfServiceMonthsInformationAboutReportingCompanyInformationAboutEmployees':
 elif fact.concept.qname.localName=='AverageAgeYearsInformationAboutReportingCompanyInformationAboutEmployees':
 elif fact.concept.qname.localName=='AverageAgeMonthsInformationAboutReportingCompanyInformationAboutEmployees':
 elif fact.concept.qname.localName=='NumberOfEmployees':
     if fact.contextID == 'CurrentYearInstant_NonConsolidatedMember':

After that, the information of each acquired element (ʻedinet_code, filer_name_jp, ʻindustry_code, salary_info, service_years, ʻage_years, number_of_employees) is stored in company_info_list. At this time, if the average years of service (months) (service_months) and average age (months) (ʻage_months) exist, we have incorporated a process to convert them into years. After that, if you store company_info_list in EDINETCODE unit (instance file unit) in ʻedinet_company_info_list`, a list containing information of employees of each company will be completed.

Code4


    company_info_list.append(edinet_code)
    company_info_list.append(filer_name_jp)
    company_info_list.append(industry_code)
    company_info_list.append(salary_info)

    if len(service_months) != 0:
        service_years_decimal= round(int(service_months)/12,1)
        service_years = int(service_years) + service_years_decimal
        service_years = str(service_years)

    company_info_list.append(service_years)

    if len(age_months) != 0:
        age_years_decimal= round(int(age_months)/12,1)
        age_years = int(age_years) + age_years_decimal
        age_years = str(age_years)

    company_info_list.append(age_years)
    company_info_list.append(number_of_employees)

    edinet_company_info_list.append(Company_info_list)

ʻEdinet_company_info_list` looks like this:

Result4


[['E00112', 'Totetsu Kogyo Co., Ltd.', 'Construction industry', '8547489', '13.8', '41.2', '1673'], ['E01422', 'Fuji Sash Co., Ltd.', 'Metal products', '5527000', '20.7', '44.7', '850'], ['E02770', 'MISUMI Group Inc. Headquarters', 'Wholesale business', '', '', '', '1293'], ['E04510', 'Electric Power Development Co., Ltd.', 'Electricity and gas industry', '7980312', '19.6', '40.9', '2445']・ ・ ・]

Finally, the obtained ʻedinet_company_info_list` is output in CSV format. This completes the extraction of the required elements.

Code5


employee_frame = pd.DataFrame(edinet_company_info_list,
                         columns=['EDINETCODE', 'Company name', 'Industry', 'Average annual salary (yen)', 'Average years of service (years)', 'Average age (years)', 'Number of employees (persons)'])
employee_frame.to_csv("C://Users//xxx//Desktop//xbrlReport//xbrl_qiita.csv", encoding='cp932')

By the way, when you run the program, the following CSV file will be finally output. There are several types of average annual salary units (yen, 1,000 yen, etc.) depending on the method of corporate disclosure. In addition, although the data is targeted for data from April 2019 onward, due to the fiscal year end of each company, there are some that have not been tagged yet and salary-related data cannot be obtained.

2. Source code


# -*- coding: utf-8 -*-

from arelle import ModelManager
from arelle import Cntlr
import os
import zipfile
import glob
import pandas as pd


def make_edinet_info_list(edinetcodedlinfo_filepath):
    edinet_info = pd.read_csv(edinetcodedlinfo_filepath, skiprows=1,
                                 encoding='cp932')
    edinet_info = edinet_info[["EDINET code", "Submitter industry"]]
    edinet_info_list = edinet_info.values.tolist()
    return edinet_info_list

def unzip_file(zip_dir,xbrl_file_expressions):
    zip_files = glob.glob(os.path.join(zip_dir, '*.zip'))

    number_of_zip_lists = len(zip_files)
    print("number_of_zip_lists:", number_of_zip_lists)

    for index, zip_file in enumerate(zip_files):
        print(zip_file, ":", index + 1, "/", number_of_zip_lists)
        with zipfile.ZipFile(zip_file) as zip_f:
            zip_f.extractall(zip_dir)
            zip_f.close()

    xbrl_files = glob.glob(xbrl_file_expressions)
    return xbrl_files

def make_edinet_company_info_list(xbrl_files,edinet_info_list):
    edinet_company_info_list = []
    for index, xbrl_file in enumerate(xbrl_files):
        edinet_code = ""  # EDINETCODE
        filer_name_jp = ""  #Company name
        industry_code = ""  #Industry
        salary_info = ""  #Average annual salary (yen)
        service_years = ""  #Average years of service (years)
        service_months = ""  #Average years of service (month)
        age_years = ""  #Average age (years)
        age_months = ""  #Average age (month)
        number_of_employees = ""  #Number of employees (persons)
        company_info_list = []  #Corporate information

        ctrl = Cntlr.Cntlr()
        model_manager = ModelManager.initialize(ctrl)
        model_xbrl = model_manager.load(xbrl_file)

        print(xbrl_file, ":", index + 1, "/", len(xbrl_files))

        for fact in model_xbrl.facts:

            if fact.concept.qname.localName == 'EDINETCodeDEI':
                print("EDINET code", fact.value)
                edinet_code = fact.value

                for code_name in edinet_info_list:
                    if code_name[0] == edinet_code:
                        print("Industry",code_name[1])
                        industry_code = code_name[1]
                        break

            elif fact.concept.qname.localName == 'FilerNameInJapaneseDEI':
                print("Company name", fact.value)
                filer_name_jp = fact.value

            elif fact.concept.qname.localName == 'AverageAnnualSalaryInformationAboutReportingCompanyInformationAboutEmployees':
                print("Average annual salary (yen)", fact.value)
                salary_info = fact.value

            elif fact.concept.qname.localName == 'AverageLengthOfServiceYearsInformationAboutReportingCompanyInformationAboutEmployees':
                print("Average years of service (years)", fact.value)
                service_years = fact.value

            elif fact.concept.qname.localName == 'AverageLengthOfServiceMonthsInformationAboutReportingCompanyInformationAboutEmployees':
                print("Average years of service (month)", fact.value)
                service_months = fact.value

            elif fact.concept.qname.localName == 'AverageAgeYearsInformationAboutReportingCompanyInformationAboutEmployees':
                print("Average age (years)", fact.value)
                age_years = fact.value

            elif fact.concept.qname.localName == 'AverageAgeMonthsInformationAboutReportingCompanyInformationAboutEmployees':
                print("Average age (month)", fact.value)
                age_months = fact.value

            elif fact.concept.qname.localName == 'NumberOfEmployees':
                if fact.contextID == 'CurrentYearInstant_NonConsolidatedMember':
                    print("Number of employees (persons)", fact.value)
                    number_of_employees = fact.value

        print("")
        company_info_list.append(edinet_code)
        company_info_list.append(filer_name_jp)
        company_info_list.append(industry_code)
        company_info_list.append(salary_info)

        if len(service_months) != 0:
            service_years_decimal = round(int(service_months) / 12, 1)
            service_years = int(service_years) + service_years_decimal
            service_years = str(service_years)

        company_info_list.append(service_years)

        if len(age_months) != 0:
            age_years_decimal = round(int(age_months) / 12, 1)
            age_years = int(age_years) + age_years_decimal
            age_years = str(age_years)

        company_info_list.append(age_years)
        company_info_list.append(number_of_employees)

        edinet_company_info_list.append(company_info_list)

    return edinet_company_info_list

def write_csv_of_employee_info(edinet_company_info_list):

    employee_frame = pd.DataFrame(edinet_company_info_list,
                         columns=['EDINETCODE', 'Company name', 'Industry', 'Average annual salary (yen)', 'Average years of service (years)', 'Average age (years)', 'Number of employees (persons)'])

    print(employee_frame)
    employee_frame.to_csv("C://Users//xxx//Desktop//xbrlReport//xbrl_qiita.csv", encoding='cp932')


def main():
    edinetcodedlinfo_filepath = 'C://Users//xxx//Desktop//xbrlReport//EdinetcodeDlInfo.csv'
    edinet_info_list = make_edinet_info_list(edinetcodedlinfo_filepath)

    zip_dir = 'C://Users//xxx//Desktop//xbrlReport//SR//'
    xbrl_file_expressions = 'C://Users//xxx//Desktop//xbrlReport//SR//XBRL//PublicDoc//*.xbrl'
    xbrl_files = unzip_file(zip_dir,xbrl_file_expressions)

    edinet_company_info_list = make_edinet_company_info_list(xbrl_files,edinet_info_list)
    print(edinet_company_info_list)

    write_csv_of_employee_info(edinet_company_info_list)
    print("extract finish")

if __name__ == "__main__":
    main()


3. How to extract non-employee information

The element name defined in XBRL (ex.EDINETCodeDEI) was described without any explanation, but all of them are published on the EDINET site. The latest version as of December 2019 is the taxonomy element list andaccount list of [About the publication of the 2020 EDINET taxonomy](https://www.fsa.go.jp/search/20191101.html). You can check from. The subjects extracted this time are listed in the taxonomy element list (1e_ElementList.xlsx). For details, see [Cabinet Office Ordinance No. 3 Form Securities Report on Disclosure of Corporate Information, etc. (jpcrp030000-asr)]-[9] Sheet --212000b Employee Status (jpcrp_212000-002_2019-11-01_pre.xml) ), Please check. By utilizing these lists, it is possible to extract various subjects other than employee information. The EDINET taxonomy may be updated in order to respond to the revision of laws and accounting standards. Therefore, when processing an instance, make sure that the EDINET taxonomy version is correct.

4. Contact information

For inquiries regarding this article, please contact the following e-mail address. e-mail:[email protected] (Of course, comments on qiita are also welcome)

This e-mail address will be the contact point for inquiries about the Development Committee of XBRL Japan, which writes the article for qiita. I will. Therefore, we cannot answer general inquiries about the organization depending on the content, but please feel free to contact us with any technical questions, opinions, requests, advice, etc. regarding XBRL. Please note that it may take some time to respond because the committee members are volunteers.

Recommended Posts

Let's automatically extract employee information such as average salary from the XBRL data disclosed on EDINET (5/10)
Let's visualize the relationship between average salary and industry with XBRL data and seaborn! (7/10)
Set information such as length on the edge of NetworkX
Information extraction from EDINET XBRL files