I read the following article about term extract. Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita
When performing morphological analysis, it is easy to make a good form when it is divided by creating a technical term dictionary that summarizes words unique to the industry, so create a user dictionary of mecab using termextract. I made it.
I just wanted to reflect the current extraction result and check it, so it's not enough to output a dictionary ... That's why I created a class that can spit out a string in the same format as the output of MeCab.
Python 3.7.5 mecab-python 0.996.3 termextract 0.12b0
Create an object while receiving the result of `MeCab.parse () and get return a string in the same format.
main
import MeCab
text = "As long as Rashomon is on Suzaku Avenue, there are likely to be a few more people besides this man, such as Ichimekasa and Eboshi, who are raining."
mecab = MeCab.Tagger()
mecab_text = mecab.parse(text)
#Pass the result of MeCab
TX = TermExtract(mecab_text)
extracted = TX.get_extracted_words() #Extract important words
modified_text = TX.get_modified_mecab_text() #Text that concatenates words based on important words
print(modified_text)
Execution result
Rashomon noun,Proper noun,General,*,*,*,Rashomon,Rashomon,La Chaumont
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
, Symbol,Comma,*,*,*,*,、,、,、
Suzaku Avenue Noun,General,*,*,*,*,Suzaku Avenue,Suzaku Ooji,Suzakuoji
Particles,Case particles,General,*,*,*,To,D,D
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
Nouns,Non-independent,Adverbs possible,*,*,*,that's all,Ijo,Ijo
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
...
Suzaku Avenue
is divided into Suzaku
and Oji
in MeCab, but they are concatenated because they are extracted as consecutive words by term extract.
The whole is posted on github. I will write about what I coded to my liking.
When concatenating multiple morphemes, only \ [surface system, prototype, reading, pronunciation ] is concatenate. The reason for not concatenating the others is to avoid creating new part of speech, such as "nouns". If not string concatenation, the value of the last word to concatenate is adopted.
my_termextract.py
def concat_morph(morphs):
'''
Combine multiple morphemes.
To combine[Surface system,Prototype,reading,pronunciation]only.
Others match the last element of the list.
Input:List of morphemes
Output:Combined morphemes
'''
import copy
new_morph = list(copy.deepcopy(morphs[-1]))
#Surface system
new_morph[0] = "".join(x[0] for x in morphs)
#Prototype
new_morph[7] = "".join(x[7] for x in morphs if x[7]!="*")
#reading
new_morph[8] = "".join(x[8] for x in morphs if x[8]!="*")
#pronunciation
new_morph[9] = "".join(x[9] for x in morphs if x[9]!="*")
return tuple(new_morph)
Suzaku noun,Proper noun,area,General,*,*,Suzaku,Suzaku,Suzaku
Oji noun,General,*,*,*,*,Oji,Oji,Oji
↓
Suzaku Avenue Noun,General,*,*,*,*,Suzaku Avenue,Suzaku Ooji,Suzakuoji
"Words composed of two or more words" included in ʻextracted_words are targeted. Basically, the result of termextract is stored, but if there are other words that you want to concatenate or you do not want to concatenate, you can handle it by overwriting ʻextracted_words
.
my_termextract.py
for cmp_noun in self.extracted_words:
#Acquisition of surface layer system
surfaces, *_ = zip(*self.morphs)
#Separate with a space
cmp_list = cmp_noun.split(" ")
len_cmp = len(cmp_list)
#Continue if not a concatenation
if len_cmp < 2:
continue
#Index matched with concatenated words
match_indeces = [i for i in range(len(surfaces)-len_cmp+1) if surfaces[i:i+len_cmp]==tuple(cmp_list)]
I refer to the article at the beginning. Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita
my_termextract.py
#Extract compound words and calculate importance
frequency = termextract.mecab.cmp_noun_dict(self.mecab_text)
LR = termextract.core.score_lr(frequency,
ignore_words=termextract.mecab.IGNORE_WORDS,
lr_mode=1, average_rate=1
)
term_imp = termextract.core.term_importance(frequency, LR)
I created it because I thought it would be convenient if I could use it by inserting it when I received the result of MeCab and implemented the subsequent processing. I think it can be used for confirmation for the time being.
Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita
Recommended Posts