at first

I read the following article about term extract. Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita

When performing morphological analysis, it is easy to make a good form when it is divided by creating a technical term dictionary that summarizes words unique to the industry, so create a user dictionary of mecab using termextract. I made it.

I just wanted to reflect the current extraction result and check it, so it's not enough to output a dictionary ... That's why I created a class that can spit out a string in the same format as the output of MeCab.

environment

Python 3.7.5 mecab-python 0.996.3 termextract 0.12b0

How to use

Create an object while receiving the result of `MeCab.parse () and get return a string in the same format.

`main`


import MeCab
text = "As long as Rashomon is on Suzaku Avenue, there are likely to be a few more people besides this man, such as Ichimekasa and Eboshi, who are raining."

mecab = MeCab.Tagger()
mecab_text = mecab.parse(text)
    
#Pass the result of MeCab
TX = TermExtract(mecab_text)
extracted = TX.get_extracted_words()  #Extract important words
modified_text = TX.get_modified_mecab_text()  #Text that concatenates words based on important words

print(modified_text)

`Execution result`


Rashomon noun,Proper noun,General,*,*,*,Rashomon,Rashomon,La Chaumont
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
, Symbol,Comma,*,*,*,*,、,、,、
Suzaku Avenue Noun,General,*,*,*,*,Suzaku Avenue,Suzaku Ooji,Suzakuoji
Particles,Case particles,General,*,*,*,To,D,D
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
Nouns,Non-independent,Adverbs possible,*,*,*,that's all,Ijo,Ijo
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
...

Suzaku Avenue is divided into Suzaku and Oji in MeCab, but they are concatenated because they are extracted as consecutive words by term extract.

Source code

The whole is posted on github. I will write about what I coded to my liking.

Concatenation of morphemes

When concatenating multiple morphemes, only \ [surface system, prototype, reading, pronunciation ] is concatenate. The reason for not concatenating the others is to avoid creating new part of speech, such as "nouns". If not string concatenation, the value of the last word to concatenate is adopted.

`my_termextract.py`


def concat_morph(morphs):
    '''
Combine multiple morphemes.
To combine[Surface system,Prototype,reading,pronunciation]only.
Others match the last element of the list.
    
    Input:List of morphemes
    Output:Combined morphemes
    '''
    import copy
    new_morph = list(copy.deepcopy(morphs[-1]))
    
    #Surface system
    new_morph[0] = "".join(x[0] for x in morphs)
    #Prototype
    new_morph[7] = "".join(x[7] for x in morphs if x[7]!="*")
    #reading
    new_morph[8] = "".join(x[8] for x in morphs if x[8]!="*")
    #pronunciation
    new_morph[9] = "".join(x[9] for x in morphs if x[9]!="*")
    return tuple(new_morph)

Example

Suzaku noun,Proper noun,area,General,*,*,Suzaku,Suzaku,Suzaku
Oji noun,General,*,*,*,*,Oji,Oji,Oji

↓

Suzaku Avenue Noun,General,*,*,*,*,Suzaku Avenue,Suzaku Ooji,Suzakuoji

About selection of words to connect

"Words composed of two or more words" included in ʻextracted_words are targeted. Basically, the result of termextract is stored, but if there are other words that you want to concatenate or you do not want to concatenate, you can handle it by overwriting ʻextracted_words.

`my_termextract.py`


for cmp_noun in self.extracted_words:
    #Acquisition of surface layer system
    surfaces, *_ = zip(*self.morphs)

    #Separate with a space
    cmp_list = cmp_noun.split(" ")
    len_cmp = len(cmp_list)
    #Continue if not a concatenation
    if len_cmp < 2:
        continue
            
    #Index matched with concatenated words
    match_indeces = [i for i in range(len(surfaces)-len_cmp+1) if surfaces[i:i+len_cmp]==tuple(cmp_list)]

About term extract parameters

I refer to the article at the beginning. Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita

`my_termextract.py`


#Extract compound words and calculate importance
frequency = termextract.mecab.cmp_noun_dict(self.mecab_text)
LR = termextract.core.score_lr(frequency,
    ignore_words=termextract.mecab.IGNORE_WORDS,
    lr_mode=1, average_rate=1
    )
term_imp = termextract.core.term_importance(frequency, LR)

Finally

I created it because I thought it would be convenient if I could use it by inserting it when I received the result of MeCab and implemented the subsequent processing. I think it can be used for confirmation for the time being.

The code is dirty, so we plan to refactor it. If you change the part written in the article, rewrite it.

Referenced page

Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita

[Python] Replace the text output by MeCab with the important words extracted by MeCab + Term Extract.

at first

environment

How to use

main

Execution result

Source code

Concatenation of morphemes

my_termextract.py

Example

About selection of words to connect

my_termextract.py

About term extract parameters

my_termextract.py

Finally

Referenced page

`main`

`Execution result`

`my_termextract.py`

`my_termextract.py`

`my_termextract.py`