[PYTHON] Search for patent data (create dashboard) while looking at the R & D part of the securities report

0. Introduction

Motivation

When analyzing patent information, it has long been said that we should look at other information as well as patents, but for me, apart from patent search, I squeeze in online search and write notes in Excel. It was quite annoying to go back and forth between patent information and other information while taking it. Therefore, I thought it would be nice to have a dashboard where you can refer to various information while focusing on patent search, so I decided to make it. First of all, I considered to list the securities report that contains fairly neat information and the text of the "Research and Development" part in it. .. The text in the "About R & D" section looks like this, and it is helpful as it describes what kind of R & D the company is trying to promote.

6 [R & D activities] The Group is engaged in strengthening product competitiveness, developing new products, and developing new business products in each segment of electronic device systems, fine tech systems, scientific / medical systems, and industrial / IT systems. We are working on it. R & D expenses for these activities in the current consolidated fiscal year were 20,163 million yen, and the R & D results for each segment are as follows. In the electronic device system semiconductor market, application processors and memory (NAND, DRAM) for mobile devices were strong due to the growth of servers and smartphones while the demand for PCs decreased. In advanced devices, the mass production ratio of 14 to 16 nm generation products using Multi-Patterning technology has increased, and the construction of a 10 nm generation prototype line and the development of the 7 nm generation have started. In addition, each memory maker is focusing on mass production of 3D-NAND. In this way, along with miniaturization, three-dimensional structuring of both logic and memory has progressed, and in cutting-edge device manufacturing, high productivity and complicated structuring corresponding to the increase in the number of processes are supported ...

Is there anything that can be used in the preceding services?

I wondered if there was a service that would provide text data for securities reports via API, etc., and found Buffet Code and [CoARiJ]. There was (https://github.com/chakki-works/CoARiJ), but neither of them met the requirements. I can't help it, so I decided to extract the text of the R & D part and store it in BiGQuery by myself.

1. Result

2. Configuration

2-1 Configuration

image.png

2-2 Codes and data used

*CoARiJ

2-3 Explanation such as SQL

Patented SQL

WITH gpat AS (
SELECT 
      publication_number as pubnum,
      top_terms,
      url
FROM patents-public-data.google_patents_research.publications
),

pat AS (
SELECT publication_number as pubnum,
       filing_date as appday,
       STRING_AGG(DISTINCT title.text) as title,
       #STRING_AGG(DISTINCT abstract.text) as abst,
       STRING_AGG(DISTINCT appls.name,'|') as applicants
FROM `patents-public-data.patents.publications`,UNNEST(title_localized) as title,UNNEST(abstract_localized) as abstract,UNNEST(assignee_harmonized) as appls
WHERE SUBSTR(publication_number,0,2) = 'JP' AND filing_date > 20100101
GROUP BY pubnum,filing_date
)

SELECT gpat.pubnum,
       gpat.url,
       gpat.top_terms,
       pat.title,
       #pat.abst,
       pat.applicants,
       pat.appday
FROM gpat
INNER JOIN pat
ON gpat.pubnum = pat.pubnum

The table that was finally output to Data Studio

Field name Type Mode Description
pubnum STRING NULLABLE Publication number
url STRING NULLABLE Link URL to google patent
top_terms STRING REPEATED Characteristic words of google extraction
title STRING NULLABLE title
applicants STRING NULLABLE applicant
appday INTEGER NULLABLE Filing date

* Securities report side

Use python to fetch xbrl format data from EDINET, and use coalij to extract the R & D part ("About R & D" part). At the same time, use sudachi to divide the word, put it in BigQuery, and calculate and store tfidf with SQL created last time.

Securities report side SQL
↓ Table generation for tfidf calculation

#SQL to calculate and store tfidf from the divided text
#~ SQL was issued too much and disappeared somewhere from the history \(^o^)/ Ota

SQL extracted with tfidf> 0.01 ↓


SELECT id,
      year,
      comname,
      secCode,
      wakatiwd,
      STRING_AGG(tfidf.word) as tfidf_word,
      COUNT(tfidf.word) as tfidf_count,
      LENGTH(wakatiwd) as textlen 
FROM ~~(secret).SecuritiesReport_tfidf,UNNEST(tfidfs) as tfidf
WHERE tfidf.tf_idf > 0.01
GROUP BY id,year,comname,secCode,wakatiwd

The table that was finally output to Data Studio

Field name Type Mode Description
id STRING NULLABLE Document ID given by EDINET
year INTEGER NULLABLE Report submission year
comname STRING NULLABLE Company name
secCode INTEGER NULLABLE Securities code
wakatiwd STRING NULLABLE R&The result of dividing the D part with sudachi
tfidf_word STRING NULLABLE with tfidf
tfidf_count INTEGER NULLABLE tfidf
textlen INTEGER NULLABLE

3. Completed version

Like this. While looking at the text in the R & D part of the securities report on the left, if you come across a technical term that interests you, set a search word in English and reflect the result in the patent list on the right. image.png

4. Version upgrade

List of features you want to add next

Recommended Posts

Search for patent data (create dashboard) while looking at the R & D part of the securities report
Create data for series labeling (part of speech tagging) quickly