[Python] Replace the text output by MeCab with the important words extracted by MeCab + Term Extract.

at first

I read the following article about term extract. Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita

When performing morphological analysis, it is easy to make a good form when it is divided by creating a technical term dictionary that summarizes words unique to the industry, so create a user dictionary of mecab using termextract. I made it.

I just wanted to reflect the current extraction result and check it, so it's not enough to output a dictionary ... That's why I created a class that can spit out a string in the same format as the output of MeCab.

environment

Python 3.7.5 mecab-python 0.996.3 termextract 0.12b0

How to use

Create an object while receiving the result of `MeCab.parse () and get return a string in the same format.

main


import MeCab
text = "As long as Rashomon is on Suzaku Avenue, there are likely to be a few more people besides this man, such as Ichimekasa and Eboshi, who are raining."

mecab = MeCab.Tagger()
mecab_text = mecab.parse(text)
    
#Pass the result of MeCab
TX = TermExtract(mecab_text)
extracted = TX.get_extracted_words()  #Extract important words
modified_text = TX.get_modified_mecab_text()  #Text that concatenates words based on important words

print(modified_text)

Execution result


Rashomon noun,Proper noun,General,*,*,*,Rashomon,Rashomon,La Chaumont
Is a particle,Case particles,General,*,*,*,But,Moth,Moth
, Symbol,Comma,*,*,*,*,、,、,、
Suzaku Avenue Noun,General,*,*,*,*,Suzaku Avenue,Suzaku Ooji,Suzakuoji
Particles,Case particles,General,*,*,*,To,D,D
A verb,Independence,*,*,Five steps, La line,Uninflected word,is there,Al,Al
Nouns,Non-independent,Adverbs possible,*,*,*,that's all,Ijo,Ijo
Is a particle,Particle,*,*,*,*,Is,C,Wow
, Symbol,Comma,*,*,*,*,、,、,、
...

Suzaku Avenue is divided into Suzaku and Oji in MeCab, but they are concatenated because they are extracted as consecutive words by term extract.

Source code

The whole is posted on github. I will write about what I coded to my liking.

Concatenation of morphemes

When concatenating multiple morphemes, only \ [surface system, prototype, reading, pronunciation ] is concatenate. The reason for not concatenating the others is to avoid creating new part of speech, such as "nouns". If not string concatenation, the value of the last word to concatenate is adopted.

my_termextract.py


def concat_morph(morphs):
    '''
Combine multiple morphemes.
To combine[Surface system,Prototype,reading,pronunciation]only.
Others match the last element of the list.
    
    Input:List of morphemes
    Output:Combined morphemes
    '''
    import copy
    new_morph = list(copy.deepcopy(morphs[-1]))
    
    #Surface system
    new_morph[0] = "".join(x[0] for x in morphs)
    #Prototype
    new_morph[7] = "".join(x[7] for x in morphs if x[7]!="*")
    #reading
    new_morph[8] = "".join(x[8] for x in morphs if x[8]!="*")
    #pronunciation
    new_morph[9] = "".join(x[9] for x in morphs if x[9]!="*")
    return tuple(new_morph)

Example

Suzaku noun,Proper noun,area,General,*,*,Suzaku,Suzaku,Suzaku
Oji noun,General,*,*,*,*,Oji,Oji,Oji

Suzaku Avenue Noun,General,*,*,*,*,Suzaku Avenue,Suzaku Ooji,Suzakuoji

About selection of words to connect

"Words composed of two or more words" included in ʻextracted_words are targeted. Basically, the result of termextract is stored, but if there are other words that you want to concatenate or you do not want to concatenate, you can handle it by overwriting ʻextracted_words.

my_termextract.py


for cmp_noun in self.extracted_words:
    #Acquisition of surface layer system
    surfaces, *_ = zip(*self.morphs)

    #Separate with a space
    cmp_list = cmp_noun.split(" ")
    len_cmp = len(cmp_list)
    #Continue if not a concatenation
    if len_cmp < 2:
        continue
            
    #Index matched with concatenated words
    match_indeces = [i for i in range(len(surfaces)-len_cmp+1) if surfaces[i:i+len_cmp]==tuple(cmp_list)]

About term extract parameters

I refer to the article at the beginning. Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita

my_termextract.py


#Extract compound words and calculate importance
frequency = termextract.mecab.cmp_noun_dict(self.mecab_text)
LR = termextract.core.score_lr(frequency,
    ignore_words=termextract.mecab.IGNORE_WORDS,
    lr_mode=1, average_rate=1
    )
term_imp = termextract.core.term_importance(frequency, LR)

Finally

I created it because I thought it would be convenient if I could use it by inserting it when I received the result of MeCab and implemented the subsequent processing. I think it can be used for confirmation for the time being.

Referenced page

Use termextract to extract jargon from retained data and create mecab user dictionary --Qiita

Recommended Posts

[Python] Replace the text output by MeCab with the important words extracted by MeCab + Term Extract.
Extract the xz file with python
Extract lines that match the conditions from a text file with python
Extract text from PowerPoint with Python! (Compatible with tables)
[Automation] Extract the table in PDF with Python
How to erase the characters output by Python
Extract the table of image files with OneDrive & Python
Extract the band information of raster data with python
I made a class to get the analysis result by MeCab in ndarray with python
Use mecab with Python3
Save the output of GAN one by one ~ With the implementation of GAN by PyTorch ~
Why can I use the module by importing with python?
[python] Decompose the acquired Twitter timeline into morphemes with MeCab
Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath