Thank you to all the researchers who are writing the Grants-in-Aid for Scientific Research application. As you know, the research adopted in the past is listed in the Kakenhi Database. However, it is quite difficult to see all of them. Let's get a rough idea of past trends! So, this time, I tried to extract keywords by natural language processing from the outline of research in the Kakenhi database. I am using the morphological analysis package MeCab and the terminology extraction tool termextract.

Environment

Use Python and Jupyter Notebook.

OS etc.

MacOS Mojave 10.14.5
Anaconda 2020.02
Python 3.7.6
Jupyter Notebook 6.0.3

MeCab Refer to here, install MeCab and mecab-python3 for morphological analysis, and set neologd as the standard dictionary. Once installed, try it with bash.

Standard dictionary ipadic (default for MeCab)

`bash`


echo "Eukaryote" | mecab
True prefix,Noun connection,*,*,*,*,true,Ma,Ma
Nuclear noun,General,*,*,*,*,Nuclear,write,write
Biological noun,General,*,*,*,*,Organism,Saves,Saves
EOS

The default ipadic does not recognize "eukaryotes".

`bash`


echo "Grant-in-Aid for Scientific Research" | mecab
Scientific nouns,General,*,*,*,*,Science,Science,Science
Research nouns,Change connection,*,*,*,*,the study,Kenkyu,Kenkyu
Expense noun,suffix,General,*,*,*,Expenses,Hi,Hi
Auxiliary noun,Change connection,*,*,*,*,auxiliary,Hojo,Hojo
Gold noun,suffix,General,*,*,*,Money,Kin,Kin
EOS

He didn't even recognize the "Grants-in-Aid for Scientific Research".

Standard dictionary neologd

`bash`


echo "Eukaryote" | mecab
Eukaryotic noun,Proper noun,General,*,*,*,Eukaryote,Shinkaku Saves,Shinkaku Saves
EOS

neologd has recognized "eukaryotes"! If this is the case, can we expect a little from keyword extraction?

`bash`


echo "Grant-in-Aid for Scientific Research" | mecab
Scientific nouns,General,*,*,*,*,Science,Science,Science
Research nouns,Change connection,*,*,*,*,the study,Kenkyu,Kenkyu
Expense noun,suffix,General,*,*,*,Expenses,Hi,Hi
Subsidy noun,Proper noun,General,*,*,*,Subsidy,Hojokin,Hojokin
EOS

"Grants-in-Aid for Scientific Research" does not seem to be recognized as one word.

mecab-python Let's try MeCab in Python. I borrowed the first sentence of the data below for testing.

`python`


import sys
import MeCab
tagger = MeCab.Tagger ("mecabrc")
print(tagger.parse ("Eukaryotes can be broadly divided into Unikont and Bikont."))

`Output result`


Eukaryotic noun,Proper noun,General,*,*,*,Eukaryote,Shinkaku Saves,Shinkaku Saves
Is a particle,Particle,*,*,*,*,Is,C,Wa
Unikont noun,Proper noun,General,*,*,*,Unikont,Unikont,Unikont
And particles,Parallel particles,*,*,*,*,When,To,To
Bikont noun,Proper noun,General,*,*,*,Bikont,Bikont,Bikont
Particles,Case particles,General,*,*,*,To,D,D
Great noun,Change connection,*,*,*,*,Roughly divided,Taibetsu,Taibetsu
Verbs that can,Independence,*,*,One step,Uninflected word,it can,Dekill,Dekill
.. symbol,Punctuation,*,*,*,*,。,。,。
EOS

I was able to morphologically analyze from Python.

termextract term extract is a package that extracts technical words. You need to pass the data in the form of MeCab analysis results. I installed it referring to here.

Download csv data from Kakenhi database

Finally, we will handle Kakenhi data. At first, I was thinking about scraping with Python, and I was researching various things such as Scraping prohibited, but I realized that I could download it with csv. , I got nothing. I will download all the items with the search word "Chlamydomonas". If you are not familiar with Chlamydomonas, please see here.

Data reading and formatting with pandas

Read the data with pandas and check it. I forgot to specify the encoding, but I could read it without any error.

`python`


import pandas as pd
kaken = pd.read_csv('kaken.nii.ac.jp_2020-10-23_22-31-59.csv')

Check the first part of the data with kaken.head (). There seems to be a lot of NaN. 2020-10-24 13.21のイメージ.jpg Check the entire data with kaken.info ().

`Output result`


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528 entries, 0 to 527
Data columns (total 40 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
0 Research subject name 528 non-null    object 
1 Research subject name(English)        269 non-null    object 
2 Research subject/Region number 528 non-null    object 
3 Research period(year)         528 non-null    object 
4 Principal 471 non-null    object 
5 Research Coordinator 160 non-null    object 
6 Collaborative Researcher 31 non-null     object 
7 Collaborators 20 non-null     object 
8 Research Fellow 53 non-null     object 
9 Foreign Research Fellow 4 non-null      object 
10 Accepted Researcher 4 non-null      object 
11 Keywords 505 non-null    object 
12 Research fields 380 non-null    object 
13 Examination category 102 non-null    object 
14 Research items 528 non-null    object 
15 Research Institute 528 non-null    object 
16 Application category 212 non-null    object 
17 Total allocation amount 526 non-null    float64
18 Total allocation(Direct expenses)       526 non-null    float64
19 Total allocation(Indirect expenses)       249 non-null    float64
20 Allocation amount for each year 526 non-null    object 
21 Allocation amount for each year(Direct expenses)     526 non-null    object 
22 Allocation amount for each year(Indirect expenses)     526 non-null    object 
23 Achievement to date(Classification code)  46 non-null     float64
24 Achievement to date(Classification)     46 non-null     object 
25 Reason 46 non-null     object 
26 Outline of research at the beginning of research 14 non-null     object 
27 Research outline 323 non-null    object 
28 Research outline(English)         156 non-null    object 
29 Outline of research results 85 non-null     object 
30 Outline of research results(English)      85 non-null     object 
31 Outline of research results 84 non-null     object 
32 Achievement to date(Paragraph)     90 non-null     object 
33 Measures to promote future research 94 non-null     object 
34 Next year's research funding plan 0 non-null      float64
35 Reason for the amount used in the next fiscal year 0 non-null      float64
36 Usage plan for next year 0 non-null      float64
37 Free description field 0 non-null      float64
38 Evaluation symbol 3 non-null      object 
39 Remarks 0 non-null      float64
dtypes: float64(9), object(31)
memory usage: 165.1+ KB

It seems that sentences are included in "Summary of research at the beginning of research", "Summary of research", "Summary of research results", and "Summary of research results". There is also a "keyword", but this time I want to extract the keyword from the text, so I will ignore it. Probably because the items to be written have changed from year to year, there are many NaNs and the lines containing the sentences are not aligned. I decided to make a list by extracting only the sentences from the data frame.

`python`


column_list = ['Outline of research at the beginning of research', 'research summary', 'Outline of research results', 'Outline of research results']
abstracts = []

for column in column_list:
    abstracts.extend(kaken[column].dropna().tolist())

スクリーンショット 2020-10-24 13.36.21.png Ready for morphological analysis. Let's perform morphological analysis on each element of this list.

Morphological analysis with MeCab

With reference to here , I defined a function that returns a list of words as a result of morphological analysis with MeCab. By default, only nouns, verbs, and adjectives are extracted, and verbs and adjectives are restored to their original form.

`python`


tagger = MeCab.Tagger('')
tagger.parse('')

def wakati_text(text, word_class = ['verb', 'adjective', 'noun']):
    #Separate each node
    node = tagger.parseToNode(text)
    terms = []
    
    while node:
        #word
        term = node.surface
        
        #Part of speech
        pos = node.feature.split(',')[0]

        #If the part of speech matches the condition
        if pos in word_class:
            if pos == 'noun':
                terms.append(term) #Form in the sentence
            else:
                terms.append(node.feature.split(",")[6]) #Put in the prototype

        node = node.next

    return terms

Let's test using a part of the data extracted earlier. スクリーンショット 2020-10-24 18.52.39.png Only nouns, verbs and adjectives can be extracted. ("9 + 2 structure" cannot be extracted ...) Apply the function wakati_text to the entire list `ʻabstracts`` to get a list of nouns, verbs and adjectives.

`python`


wakati_abstracts = []

for abstract in abstracts:
        wakati_abstracts.extend(wakati_text(abstract))

You now have a list of nouns, verbs, and adjectives. スクリーンショット 2020-10-24 18.57.10.png

Visualization

Count the elements in the list wakati_abstracts and try to make a bar graph from the largest number to the 50th place.

`python`


import collections
import matplotlib.pyplot as plt
import matplotlib as mpl

words, counts = zip(*collections.Counter(wakati_abstracts).most_common())

mpl.rcParams['font.family'] = 'Noto Sans JP Regular'
plt.figure(figsize=[12, 6])
plt.bar(words[0:50], counts[0:50])
plt.xticks(rotation =90)
plt.ylabel('freq')
plt.savefig('kaken_bar.png', dpi=200, bbox_inches="tight")

Since the stop word was not removed, "do", "koto", "reru", "is", "target", etc. are ranked high. In addition to the search word "Chlamydomonas", words familiar to Chlamydomonas related people such as "gene", "light", "cell", "flagella", "protein", and "dynein" are lined up. Didn't you need verbs and adjectives? It is a result that seems to be.

Extraction of nouns only

I tried to extract only nouns by the same procedure as above. Just set the second argument of the function wakati_abstract to ['noun'].

`python`


noun_abstracts = []

for abstract in abstracts:
        noun_abstracts.extend(wakati_text(abstract, ['noun']))

The code in the middle is the same as above, so I will omit it and show the result of visualization. I'm worried that "koto" is in first place and that the numbers "1", "2", and "3" are included, but the result is a little more like a keyword than before.

Terminology extraction using termextract

Next, let's use term extract to extract the jargon. I tried the morphological analysis method with reference to here.

Data shaping

The input format of termextract is the output result of morphological analysis of MeCab. Parse the list `ʻabstracts`` with MeCab and concatenate the parsing results of each element into a format separated by line breaks.

`python`


#Pass in the form of mecab
mecab_abstracts = []

for abstract in abstracts:
        mecab_abstracts.append(tagger.parse(abstract))

input_text = '/n'.join(mecab_abstracts)

スクリーンショット 2020-10-24 19.17.59.png

Analyze with term extract

The code is almost entirely here.

`python`


import termextract.mecab
import termextract.core

word_list = []
value_list = []

frequency = termextract.mecab.cmp_noun_dict(input_text)
LR = termextract.core.score_lr(frequency,
         ignore_words=termextract.mecab.IGNORE_WORDS,
         lr_mode=1, average_rate=1
     )
term_imp = termextract.core.term_importance(frequency, LR)

#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
    word = termextract.core.modify_agglutinative_lang(cmp_noun)
    word_list.append(word)
    value_list.append(value)
    print(word, value, sep="\t")

スクリーンショット 2020-10-24 19.22.03.png I'm not sure what the score means, but I'm seeing those words. Let's visualize this as well.

Visualization

The code is the same as above, so I'll omit it. More likely words such as "photosystem II", "transformant", "flagellar movement", and "gene group" are taken. Isn't it ok that "Chlamydomonas" and "green alga Chlamydomonas" and "dynein" and "axoneme dynein" are different items?

Summary

Keywords were extracted from the search results of the Kakenhi database. Compared to the result of only morphological analysis with MeCab, term extract was able to extract words that are more like keywords.

Bonus: GiNZA

I also tried GiNZA named entity recognition.

`python`


import spacy
from spacy import displacy

nlp = spacy.load('ja_ginza')
doc = nlp(abstracts[0]) 

#Drawing the result of named entity extraction
displacy.render(doc, style="ent", jupyter=True)

スクリーンショット 2020-10-24 20.21.44.png

It's not a unique expression, so I can't help but I can't get the expressions I want to take, such as "Unikont," "Bikont," "cilia," and "Chlamydomonas." And after all "9 + 2 structure" cannot be taken.

reference

-Prepare an environment where MeCab can be used on Mac -Extract only words with specific part of speech in Python and Mecab -Easy keyword extraction with TermExtract for Python -I tried to extract named entities with the natural language processing library GiNZA -Biological Exercise Machinery Picture Book Chlamydomonas (Swimming Exercise) -9 + 2 structure from ancient times, the mystery of cilia

[PYTHON] [Natural language processing] Extract keywords from Kakenhi database with MeCab-ipadic-neologd and termextract

Environment

OS etc.

Standard dictionary ipadic (default for MeCab)

bash

bash

Standard dictionary neologd

bash

bash

python

Output result

Download csv data from Kakenhi database

Data reading and formatting with pandas

python

Output result

python

Morphological analysis with MeCab

python

python

Visualization

python

Extraction of nouns only

python

Terminology extraction using termextract

Data shaping

python

Analyze with term extract

python

Visualization

Summary

Bonus: GiNZA

python

reference

`bash`

`bash`

`bash`

`bash`

`python`

`Output result`

`python`

`Output result`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`python`