I don't know what is written in the securities report, and even if I read the contents of a specific year of one company I want to know, I can not understand it unless I compare it with other companies / other years: frowning2: So I'm going to find out how to use ** COTOHA to read a large amount of securities reports **: wink:
* For basic usage of COTOHA, see @ gossy5454's Article </ font>
――I want to easily perform natural language processing using COTOHA --Not well known? I want to use CoARiJ ――I want to read securities reports efficiently and utilize them for investment.
The code used for this verification is below https://github.com/ice-github/CoARiJAndCOTOHA
NTT Group's natural language processing platform https://api.ce-cotoha.com/contents/index.html
- ** Similarity judgment ** </ font> ――It seems that you can compare it with past reports - ** Summary (β) ** </ font> ――I'm wondering what happens when a report that has already been written concisely is summarized. - ** Named entity recognition ** </ font> ――It seems that you can roughly understand what kind of topics are in the report. - ** Sentiment analysis ** </ font> ――It seems that you can see the discrepancy that the expression is negative even though the profit is increasing.
* (β) is a beta function </ font>
A dataset published by TIS for corporate analysis https://www.tis.co.jp/news/2019/tis_news/20191114_1.html
Install coarij and download 2014-2018 data as Extract data
A data directory is created under
$ cd <WorkspacePath>
$ pip install coarij
$ coarij download --kind E --year 2014
$ coarij download --kind E --year 2015
$ coarij download --kind E --year 2016
$ coarij download --kind E --year 2017
$ coarij download --kind E --year 2018
By opening data / interim / \ <year > /documents.csv in Pandas and reading the contents as shown below, you can access the text information under the data / interim / \ <year > / docs / directory. Become
CompanyInformation.py
...
csv_path = os.path.join(data_directory_path, 'interim', str(year),
'documents.csv')
self.csv_document = pd.read_csv(filepath_or_buffer=csv_path,
encoding='UTF-8',
sep='\t',
index_col='sec_code') # use sec_code as index
...
There are three main problems.
The data may be missing and ** a company that was in one year disappeared in another **.
Since 2017, 2. Business status items have changed and ** files may be empty ** * This is an EDINET issue rather than CoARiJ => Reference < / font>
** Total sales and operating income figures that should be in data / interim / \ <year > /documents.csv may be missing ** * As confirmed by TOPIX Core 30, 10 out of 30 companies have data from 2014 to 2018. Was missing </ font>
The securities report has various items, but how do you read it?
In my case ** 1. [Overview of the company] ** Read the changes in total sales, operating income, ordinary income, and capital adequacy ratio. ** 2. Read [Business Status] ** carefully ** 3. Read afterwards **
Is taking the style So this time, I will pay attention to ** 1. [Company overview] ** and ** 2. [Business status] ** </ font>.
In CoARiJ ** 1. [Company overview] ** is because it is organized in a .csv file, ** 2. [Business status] * * .Txt file will be input to COTOHA
* Since the .csv file contains individual results, the holding company (Holdings) may have an unusually high profit margin (80% or more for Konami and NTT). There is </ font>
-[Management policy, business environment and issues to be addressed] - business_policy_environment_issue_etc.txt -[Risks of business, etc.] - business_risks.txt
This item often does not change significantly, so ** similarity judgment </ font> ** to check if it is different from the previous year, and if it is different * * Use Summary (β) </ font> ** to shorten sentences
I wanted to handle the text as it is if possible, but when I request text that exceeds ** 3000 characters (too long url?), I sometimes get 500 (Internal Server Error) **, so I split the text.
main.py
...
def GetDividedSubstring(text: str, max_length: int) -> List[str]:
if len(text) < max_length:
return [text]
index = text.rfind('。', 0, max_length)
if index < 0:
index = text.rfind('\n\n', 0, max_length) # try to find double newlines
if index < 0:
if len(text) < max_length:
return [text]
else:
index = max_length
head = text[: index + 1]
tail = text[index + 1:]
result: List[str] = [head]
for text in GetDividedSubstring(tail, max_length):
result.append(text)
return result
...
I compare the similarity between the divided character strings, but the number of divisions may be different, so in that case ** 0 is returned without comparison ** * This string division is also used in the following sentiment analysis, named entity extraction and summary </ font>
-[Analysis of financial position, operating results and cash flow status](before 2017) - business_analysis_of_finance.txt -[Analysis of financial condition, operating results and cash flow by management](2018 or later) - business_management_analysis.txt -[R & D activities] - business_research_and_development.txt
For ** cash flow **, check if it is positive with ** sentiment analysis </ font> ** and judge it as negative even though operating income or ordinary income is increasing. If so, use the original text, otherwise use ** Summary (β) </ font> ** to shorten the text. As for ** R & D activities **, unfamiliar words are lined up, so I did ** named entity extraction </ font> ** and roughly checked what kind of content was included. Then use ** Summary (β) </ font> **
** Named Entity Extraction </ font> ** has various named entity classes, but ** Unique Name (ART) ** and ** Personal Name (PSN) as shown below. By filtering to ** and ** location (LOC) **, we try not to include numerical values such as amounts that do not make sense even if extracted individually.
* Sentence (1 sentence) and document (multiple sentences) are used without distinction, but it seems that they are separated due to API design. </ Font>
main.py
...
for dict_index in range(len(ne['result'])):
word_class = ne['result'][dict_index]['class']
if word_class == 'ART' or word_class == 'PSN' or word_class == 'LOC':
if word not in words: # Unique
words.append(ne['result'][dict_index]['form'])
...
We will pick up a few companies whose results are likely to be easy to understand: wink:
Total sales: 568,032 million yen (2017) => 580,144 million yen (2018) Operating income: 29,897 million yen (2017) => 29,511 million yen (2018), operating income margin 5.26% (2017) => 5.09% (2018) Ordinary income: 30.650 billion yen (2017) => 29.864 billion yen (2018), Ordinary income ratio 5.40% (2017) => 5.15% (2018)
Attempts were made to convert a large amount of securities reports into an easy-to-read form by analyzing / summarizing CoARiJ data with COTOHA. As a result, I think that ** "Judgment whether to read the original text of the securities report" </ font> ** can be done. ~~ * It seems that you can judge that the scale of the business is so big that you do not know what is written. </ Font> ~~
Since there is a character limit in the request to be sent to COTOHA, it is necessary to divide the sentence and call the API multiple times, so be careful when handling large amounts of data: sweat_smile: * In this article, 429 (Too Many Requests) was issued and verification did not proceed ... </ font>
Recommended Posts