0. Introduction

Motivation

When analyzing patent information, it has long been said that we should look at other information as well as patents, but for me, apart from patent search, I squeeze in online search and write notes in Excel. It was quite annoying to go back and forth between patent information and other information while taking it. Therefore, I thought it would be nice to have a dashboard where you can refer to various information while focusing on patent search, so I decided to make it. First of all, I considered to list the securities report that contains fairly neat information and the text of the "Research and Development" part in it. .. The text in the "About R & D" section looks like this, and it is helpful as it describes what kind of R & D the company is trying to promote.

6 [R & D activities] The Group is engaged in strengthening product competitiveness, developing new products, and developing new business products in each segment of electronic device systems, fine tech systems, scientific / medical systems, and industrial / IT systems. We are working on it. R & D expenses for these activities in the current consolidated fiscal year were 20,163 million yen, and the R & D results for each segment are as follows. In the electronic device system semiconductor market, application processors and memory (NAND, DRAM) for mobile devices were strong due to the growth of servers and smartphones while the demand for PCs decreased. In advanced devices, the mass production ratio of 14 to 16 nm generation products using Multi-Patterning technology has increased, and the construction of a 10 nm generation prototype line and the development of the 7 nm generation have started. In addition, each memory maker is focusing on mass production of 3D-NAND. In this way, along with miniaturization, three-dimensional structuring of both logic and memory has progressed, and in cutting-edge device manufacturing, high productivity and complicated structuring corresponding to the increase in the number of processes are supported ...

Is there anything that can be used in the preceding services?

I wondered if there was a service that would provide text data for securities reports via API, etc., and found Buffet Code and [CoARiJ]. There was (https://github.com/chakki-works/CoARiJ), but neither of them met the requirements. I can't help it, so I decided to extract the text of the R & D part and store it in BiGQuery by myself.

1. Result

Made
DataStudio is convenient!

2. Configuration

2-1 Configuration

2-2 Codes and data used

*CoARiJ

Obtain securities report data from EDINET this article
Familiar Google Patents Public Datasets

2-3 Explanation such as SQL

Information acquisition SQL part from Google Patents: The url to Google patents and the characteristic words in the patent text extracted by Google are not included in the normal patent table, so two tables are used. JOIN had a little trouble.

Patented SQL


WITH gpat AS (
SELECT 
      publication_number as pubnum,
      top_terms,
      url
FROM patents-public-data.google_patents_research.publications
),

pat AS (
SELECT publication_number as pubnum,
       filing_date as appday,
       STRING_AGG(DISTINCT title.text) as title,
       #STRING_AGG(DISTINCT abstract.text) as abst,
       STRING_AGG(DISTINCT appls.name,'|') as applicants
FROM `patents-public-data.patents.publications`,UNNEST(title_localized) as title,UNNEST(abstract_localized) as abstract,UNNEST(assignee_harmonized) as appls
WHERE SUBSTR(publication_number,0,2) = 'JP' AND filing_date > 20100101
GROUP BY pubnum,filing_date
)

SELECT gpat.pubnum,
       gpat.url,
       gpat.top_terms,
       pat.title,
       #pat.abst,
       pat.applicants,
       pat.appday
FROM gpat
INNER JOIN pat
ON gpat.pubnum = pat.pubnum

The table that was finally output to Data Studio

Field name	Type	Mode	Description
pubnum	STRING	NULLABLE	Publication number
url	STRING	NULLABLE	Link URL to google patent
top_terms	STRING	REPEATED	Characteristic words of google extraction
title	STRING	NULLABLE	title
applicants	STRING	NULLABLE	applicant
appday	INTEGER	NULLABLE	Filing date

* Securities report side

Use python to fetch xbrl format data from EDINET, and use coalij to extract the R & D part ("About R & D" part). At the same time, use sudachi to divide the word, put it in BigQuery, and calculate and store tfidf with SQL created last time.

Securities report side SQL

↓ Table generation for tfidf calculation


#SQL to calculate and store tfidf from the divided text
#~ SQL was issued too much and disappeared somewhere from the history \(^o^)/ Ota

SQL extracted with tfidf> 0.01 ↓


SELECT id,
      year,
      comname,
      secCode,
      wakatiwd,
      STRING_AGG(tfidf.word) as tfidf_word,
      COUNT(tfidf.word) as tfidf_count,
      LENGTH(wakatiwd) as textlen 
FROM ~~(secret).SecuritiesReport_tfidf,UNNEST(tfidfs) as tfidf
WHERE tfidf.tf_idf > 0.01
GROUP BY id,year,comname,secCode,wakatiwd

The table that was finally output to Data Studio

Field name	Type	Mode	Description
id	STRING	NULLABLE	Document ID given by EDINET
year	INTEGER	NULLABLE	Report submission year
comname	STRING	NULLABLE	Company name
secCode	INTEGER	NULLABLE	Securities code
wakatiwd	STRING	NULLABLE	R&The result of dividing the D part with sudachi
tfidf_word	STRING	NULLABLE	with tfidf
tfidf_count	INTEGER	NULLABLE	tfidf
textlen	INTEGER	NULLABLE

3. Completed version

Like this. While looking at the text in the R & D part of the securities report on the left, if you come across a technical term that interests you, set a search word in English and reflect the result in the patent list on the right.

4. Version upgrade

List of features you want to add next

From the words extracted from the R & D text on the left, extract words that look like technical terms and put them in the search field on the right.
Word expansion (in patent version word2vec, technical term dictionary, etc.)
pipeline (This time it was the securities report data until June 2020, but I would like to automatically put it in BigQuery about every month rather than doing it in batch)
Actually, it's easier to get a foreign one (BigQuery's Public DataSet has something like that )))
We are considering seeing other data sources (legal regulations, social trends, etc.) together. ..
I don't like the screen design. It seems that the layout of BI tools and some training are required.
There was also this kind of, so I should have acquired the other text parts as well!
Changes tfidf in real time according to the population. When I made it for the first time this time, it was so heavy that it often froze, so I calculated it as a whole. Since the initial state of tfidf is ideal in the first place, we are considering what to do.

[PYTHON] Search for patent data (create dashboard) while looking at the R & D part of the securities report