When analyzing patent information, it has long been said that we should look at other information as well as patents, but for me, apart from patent search, I squeeze in online search and write notes in Excel. It was quite annoying to go back and forth between patent information and other information while taking it. Therefore, I thought it would be nice to have a dashboard where you can refer to various information while focusing on patent search, so I decided to make it. First of all, I considered to list the securities report that contains fairly neat information and the text of the "Research and Development" part in it. .. The text in the "About R & D" section looks like this, and it is helpful as it describes what kind of R & D the company is trying to promote.
6 [R & D activities] The Group is engaged in strengthening product competitiveness, developing new products, and developing new business products in each segment of electronic device systems, fine tech systems, scientific / medical systems, and industrial / IT systems. We are working on it. R & D expenses for these activities in the current consolidated fiscal year were 20,163 million yen, and the R & D results for each segment are as follows. In the electronic device system semiconductor market, application processors and memory (NAND, DRAM) for mobile devices were strong due to the growth of servers and smartphones while the demand for PCs decreased. In advanced devices, the mass production ratio of 14 to 16 nm generation products using Multi-Patterning technology has increased, and the construction of a 10 nm generation prototype line and the development of the 7 nm generation have started. In addition, each memory maker is focusing on mass production of 3D-NAND. In this way, along with miniaturization, three-dimensional structuring of both logic and memory has progressed, and in cutting-edge device manufacturing, high productivity and complicated structuring corresponding to the increase in the number of processes are supported ...
I wondered if there was a service that would provide text data for securities reports via API, etc., and found Buffet Code and [CoARiJ]. There was (https://github.com/chakki-works/CoARiJ), but neither of them met the requirements. I can't help it, so I decided to extract the text of the R & D part and store it in BiGQuery by myself.
WITH gpat AS (
SELECT
publication_number as pubnum,
top_terms,
url
FROM patents-public-data.google_patents_research.publications
),
pat AS (
SELECT publication_number as pubnum,
filing_date as appday,
STRING_AGG(DISTINCT title.text) as title,
#STRING_AGG(DISTINCT abstract.text) as abst,
STRING_AGG(DISTINCT appls.name,'|') as applicants
FROM `patents-public-data.patents.publications`,UNNEST(title_localized) as title,UNNEST(abstract_localized) as abstract,UNNEST(assignee_harmonized) as appls
WHERE SUBSTR(publication_number,0,2) = 'JP' AND filing_date > 20100101
GROUP BY pubnum,filing_date
)
SELECT gpat.pubnum,
gpat.url,
gpat.top_terms,
pat.title,
#pat.abst,
pat.applicants,
pat.appday
FROM gpat
INNER JOIN pat
ON gpat.pubnum = pat.pubnum
The table that was finally output to Data Studio
Field name | Type | Mode | Description |
---|---|---|---|
pubnum | STRING | NULLABLE | Publication number |
url | STRING | NULLABLE | Link URL to google patent |
top_terms | STRING | REPEATED | Characteristic words of google extraction |
title | STRING | NULLABLE | title |
applicants | STRING | NULLABLE | applicant |
appday | INTEGER | NULLABLE | Filing date |
Use python to fetch xbrl format data from EDINET, and use coalij to extract the R & D part ("About R & D" part). At the same time, use sudachi to divide the word, put it in BigQuery, and calculate and store tfidf with SQL created last time.
#SQL to calculate and store tfidf from the divided text
#~ SQL was issued too much and disappeared somewhere from the history \(^o^)/ Ota
SQL extracted with tfidf> 0.01 ↓
SELECT id,
year,
comname,
secCode,
wakatiwd,
STRING_AGG(tfidf.word) as tfidf_word,
COUNT(tfidf.word) as tfidf_count,
LENGTH(wakatiwd) as textlen
FROM ~~(secret).SecuritiesReport_tfidf,UNNEST(tfidfs) as tfidf
WHERE tfidf.tf_idf > 0.01
GROUP BY id,year,comname,secCode,wakatiwd
The table that was finally output to Data Studio
Field name | Type | Mode | Description |
---|---|---|---|
id | STRING | NULLABLE | Document ID given by EDINET |
year | INTEGER | NULLABLE | Report submission year |
comname | STRING | NULLABLE | Company name |
secCode | INTEGER | NULLABLE | Securities code |
wakatiwd | STRING | NULLABLE | R&The result of dividing the D part with sudachi |
tfidf_word | STRING | NULLABLE | with tfidf |
tfidf_count | INTEGER | NULLABLE | tfidf |
textlen | INTEGER | NULLABLE |
Like this. While looking at the text in the R & D part of the securities report on the left, if you come across a technical term that interests you, set a search word in English and reflect the result in the patent list on the right.
List of features you want to add next