Thank you to all the researchers who are writing the Grants-in-Aid for Scientific Research application. As you know, the research adopted in the past is listed in the Kakenhi Database. However, it is quite difficult to see all of them. Let's get a rough idea of past trends! So, this time, I tried to extract keywords by natural language processing from the outline of research in the Kakenhi database. I am using the morphological analysis package MeCab and the terminology extraction tool termextract.
Use Python and Jupyter Notebook.
MeCab Refer to here, install MeCab and mecab-python3 for morphological analysis, and set neologd as the standard dictionary. Once installed, try it with bash.
bash
echo "Eukaryote" | mecab
True prefix,Noun connection,*,*,*,*,true,Ma,Ma
Nuclear noun,General,*,*,*,*,Nuclear,write,write
Biological noun,General,*,*,*,*,Organism,Saves,Saves
EOS
The default ipadic does not recognize "eukaryotes".
bash
echo "Grant-in-Aid for Scientific Research" | mecab
Scientific nouns,General,*,*,*,*,Science,Science,Science
Research nouns,Change connection,*,*,*,*,the study,Kenkyu,Kenkyu
Expense noun,suffix,General,*,*,*,Expenses,Hi,Hi
Auxiliary noun,Change connection,*,*,*,*,auxiliary,Hojo,Hojo
Gold noun,suffix,General,*,*,*,Money,Kin,Kin
EOS
He didn't even recognize the "Grants-in-Aid for Scientific Research".
bash
echo "Eukaryote" | mecab
Eukaryotic noun,Proper noun,General,*,*,*,Eukaryote,Shinkaku Saves,Shinkaku Saves
EOS
neologd has recognized "eukaryotes"! If this is the case, can we expect a little from keyword extraction?
bash
echo "Grant-in-Aid for Scientific Research" | mecab
Scientific nouns,General,*,*,*,*,Science,Science,Science
Research nouns,Change connection,*,*,*,*,the study,Kenkyu,Kenkyu
Expense noun,suffix,General,*,*,*,Expenses,Hi,Hi
Subsidy noun,Proper noun,General,*,*,*,Subsidy,Hojokin,Hojokin
EOS
"Grants-in-Aid for Scientific Research" does not seem to be recognized as one word.
mecab-python Let's try MeCab in Python. I borrowed the first sentence of the data below for testing.
python
import sys
import MeCab
tagger = MeCab.Tagger ("mecabrc")
print(tagger.parse ("Eukaryotes can be broadly divided into Unikont and Bikont."))
Output result
Eukaryotic noun,Proper noun,General,*,*,*,Eukaryote,Shinkaku Saves,Shinkaku Saves
Is a particle,Particle,*,*,*,*,Is,C,Wa
Unikont noun,Proper noun,General,*,*,*,Unikont,Unikont,Unikont
And particles,Parallel particles,*,*,*,*,When,To,To
Bikont noun,Proper noun,General,*,*,*,Bikont,Bikont,Bikont
Particles,Case particles,General,*,*,*,To,D,D
Great noun,Change connection,*,*,*,*,Roughly divided,Taibetsu,Taibetsu
Verbs that can,Independence,*,*,One step,Uninflected word,it can,Dekill,Dekill
.. symbol,Punctuation,*,*,*,*,。,。,。
EOS
I was able to morphologically analyze from Python.
termextract term extract is a package that extracts technical words. You need to pass the data in the form of MeCab analysis results. I installed it referring to here.
Finally, we will handle Kakenhi data. At first, I was thinking about scraping with Python, and I was researching various things such as Scraping prohibited, but I realized that I could download it with csv. , I got nothing. I will download all the items with the search word "Chlamydomonas". If you are not familiar with Chlamydomonas, please see here.
Read the data with pandas and check it. I forgot to specify the encoding, but I could read it without any error.
python
import pandas as pd
kaken = pd.read_csv('kaken.nii.ac.jp_2020-10-23_22-31-59.csv')
Check the first part of the data with kaken.head ()
. There seems to be a lot of NaN.
Check the entire data with kaken.info ()
.
Output result
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 528 entries, 0 to 527
Data columns (total 40 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Research subject name 528 non-null object
1 Research subject name(English) 269 non-null object
2 Research subject/Region number 528 non-null object
3 Research period(year) 528 non-null object
4 Principal 471 non-null object
5 Research Coordinator 160 non-null object
6 Collaborative Researcher 31 non-null object
7 Collaborators 20 non-null object
8 Research Fellow 53 non-null object
9 Foreign Research Fellow 4 non-null object
10 Accepted Researcher 4 non-null object
11 Keywords 505 non-null object
12 Research fields 380 non-null object
13 Examination category 102 non-null object
14 Research items 528 non-null object
15 Research Institute 528 non-null object
16 Application category 212 non-null object
17 Total allocation amount 526 non-null float64
18 Total allocation(Direct expenses) 526 non-null float64
19 Total allocation(Indirect expenses) 249 non-null float64
20 Allocation amount for each year 526 non-null object
21 Allocation amount for each year(Direct expenses) 526 non-null object
22 Allocation amount for each year(Indirect expenses) 526 non-null object
23 Achievement to date(Classification code) 46 non-null float64
24 Achievement to date(Classification) 46 non-null object
25 Reason 46 non-null object
26 Outline of research at the beginning of research 14 non-null object
27 Research outline 323 non-null object
28 Research outline(English) 156 non-null object
29 Outline of research results 85 non-null object
30 Outline of research results(English) 85 non-null object
31 Outline of research results 84 non-null object
32 Achievement to date(Paragraph) 90 non-null object
33 Measures to promote future research 94 non-null object
34 Next year's research funding plan 0 non-null float64
35 Reason for the amount used in the next fiscal year 0 non-null float64
36 Usage plan for next year 0 non-null float64
37 Free description field 0 non-null float64
38 Evaluation symbol 3 non-null object
39 Remarks 0 non-null float64
dtypes: float64(9), object(31)
memory usage: 165.1+ KB
It seems that sentences are included in "Summary of research at the beginning of research", "Summary of research", "Summary of research results", and "Summary of research results". There is also a "keyword", but this time I want to extract the keyword from the text, so I will ignore it. Probably because the items to be written have changed from year to year, there are many NaNs and the lines containing the sentences are not aligned. I decided to make a list by extracting only the sentences from the data frame.
python
column_list = ['Outline of research at the beginning of research', 'research summary', 'Outline of research results', 'Outline of research results']
abstracts = []
for column in column_list:
abstracts.extend(kaken[column].dropna().tolist())
Ready for morphological analysis. Let's perform morphological analysis on each element of this list.
With reference to here , I defined a function that returns a list of words as a result of morphological analysis with MeCab. By default, only nouns, verbs, and adjectives are extracted, and verbs and adjectives are restored to their original form.
python
tagger = MeCab.Tagger('')
tagger.parse('')
def wakati_text(text, word_class = ['verb', 'adjective', 'noun']):
#Separate each node
node = tagger.parseToNode(text)
terms = []
while node:
#word
term = node.surface
#Part of speech
pos = node.feature.split(',')[0]
#If the part of speech matches the condition
if pos in word_class:
if pos == 'noun':
terms.append(term) #Form in the sentence
else:
terms.append(node.feature.split(",")[6]) #Put in the prototype
node = node.next
return terms
Let's test using a part of the data extracted earlier.
Only nouns, verbs and adjectives can be extracted. ("9 + 2 structure" cannot be extracted ...)
Apply the function wakati_text
to the entire list `ʻabstracts`` to get a list of nouns, verbs and adjectives.
python
wakati_abstracts = []
for abstract in abstracts:
wakati_abstracts.extend(wakati_text(abstract))
You now have a list of nouns, verbs, and adjectives.
Count the elements in the list wakati_abstracts
and try to make a bar graph from the largest number to the 50th place.
python
import collections
import matplotlib.pyplot as plt
import matplotlib as mpl
words, counts = zip(*collections.Counter(wakati_abstracts).most_common())
mpl.rcParams['font.family'] = 'Noto Sans JP Regular'
plt.figure(figsize=[12, 6])
plt.bar(words[0:50], counts[0:50])
plt.xticks(rotation =90)
plt.ylabel('freq')
plt.savefig('kaken_bar.png', dpi=200, bbox_inches="tight")
Since the stop word was not removed, "do", "koto", "reru", "is", "target", etc. are ranked high. In addition to the search word "Chlamydomonas", words familiar to Chlamydomonas related people such as "gene", "light", "cell", "flagella", "protein", and "dynein" are lined up. Didn't you need verbs and adjectives? It is a result that seems to be.
I tried to extract only nouns by the same procedure as above.
Just set the second argument of the function wakati_abstract
to ['noun']
.
python
noun_abstracts = []
for abstract in abstracts:
noun_abstracts.extend(wakati_text(abstract, ['noun']))
The code in the middle is the same as above, so I will omit it and show the result of visualization. I'm worried that "koto" is in first place and that the numbers "1", "2", and "3" are included, but the result is a little more like a keyword than before.
Next, let's use term extract to extract the jargon. I tried the morphological analysis method with reference to here.
The input format of termextract is the output result of morphological analysis of MeCab. Parse the list `ʻabstracts`` with MeCab and concatenate the parsing results of each element into a format separated by line breaks.
python
#Pass in the form of mecab
mecab_abstracts = []
for abstract in abstracts:
mecab_abstracts.append(tagger.parse(abstract))
input_text = '/n'.join(mecab_abstracts)
The code is almost entirely here.
python
import termextract.mecab
import termextract.core
word_list = []
value_list = []
frequency = termextract.mecab.cmp_noun_dict(input_text)
LR = termextract.core.score_lr(frequency,
ignore_words=termextract.mecab.IGNORE_WORDS,
lr_mode=1, average_rate=1
)
term_imp = termextract.core.term_importance(frequency, LR)
#Sort and output in descending order of importance
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
word = termextract.core.modify_agglutinative_lang(cmp_noun)
word_list.append(word)
value_list.append(value)
print(word, value, sep="\t")
I'm not sure what the score means, but I'm seeing those words. Let's visualize this as well.
The code is the same as above, so I'll omit it. More likely words such as "photosystem II", "transformant", "flagellar movement", and "gene group" are taken. Isn't it ok that "Chlamydomonas" and "green alga Chlamydomonas" and "dynein" and "axoneme dynein" are different items?
Keywords were extracted from the search results of the Kakenhi database. Compared to the result of only morphological analysis with MeCab, term extract was able to extract words that are more like keywords.
I also tried GiNZA named entity recognition.
python
import spacy
from spacy import displacy
nlp = spacy.load('ja_ginza')
doc = nlp(abstracts[0])
#Drawing the result of named entity extraction
displacy.render(doc, style="ent", jupyter=True)
It's not a unique expression, so I can't help but I can't get the expressions I want to take, such as "Unikont," "Bikont," "cilia," and "Chlamydomonas." And after all "9 + 2 structure" cannot be taken.
-Prepare an environment where MeCab can be used on Mac -Extract only words with specific part of speech in Python and Mecab -Easy keyword extraction with TermExtract for Python -I tried to extract named entities with the natural language processing library GiNZA -Biological Exercise Machinery Picture Book Chlamydomonas (Swimming Exercise) -9 + 2 structure from ancient times, the mystery of cilia
Recommended Posts