I wrote a class that makes it easier to divide by specifying part of speech when using Mecab in python

When using mecab with python, I have to rewrite it in various ways when I want to specify the part of speech and write it freely, so I wrote the class myself to eliminate the inconvenience, so I will publish it.

import MeCab
import unicodedata
import re


class MecabParser():
    def __init__(self, word_classes=None, word_class_details=None):
        """
        Args:
            word_classes (list, optional):Part of speech specified in Japanese. Defaults to None.
            word_class_details (list, optional):Specifying the details of part of speech. Defaults to None.

See below for the part of speech defined by mecab
        https://taku910.github.io/mecab/posid.html
        """
        self._word_classes = word_classes
        self._word_class_details = word_class_details

    def _format_text(self, text):
        """
Formatting text before putting it in MeCab
        """
        text = re.sub(r'http(s)?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', "", text)
        text = re.sub(r'[ -/:-@\[-~_]', "", text)  #Half-width symbol
        text = re.sub(r'[︰-@]', "", text)  #Double-byte symbol
        text = re.sub(r'\d', "", text)  #Numbers
        text = re.sub('\n', " ", text)  #Newline character
        text = re.sub('\r', " ", text)  #Newline character

        return text

    def parse(self, text, is_base=False):
        text = self._format_text(text)

        #Character code conversion process. If not converted, the voiced sound mark and the semi-voiced sound mark will be separated.
        text = unicodedata.normalize('NFC', text)

        result = []
        tagger = MeCab.Tagger(
            '-d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
        #You can avoid surface read error by parsing once before parseToNode
        tagger.parse('')

        nodes = tagger.parseToNode(text)
        while nodes:
            wclass = nodes.feature.split(',')
            #If no part of speech is specified, all words are separated.
            if not self._word_classes:
                result.append(wclass[6] if is_base else nodes.surface)
                nodes = nodes.next
                continue

            #If no part of speech details are specified, all the part of speech is divided.
            if not self._word_class_details:
                if wclass[0] in self._word_classes:
                    result.append(wclass[6] if is_base else nodes.surface)
                    nodes = nodes.next
                    continue

            #Divide according to the detailed specification of part of speech
            if wclass[0] in self._word_classes and wclass[1] in self._word_class_details:
                result.append(wclass[6] if is_base else nodes.surface)
            nodes = nodes.next

        #Remove first and last whitespace strings
        if len(result) > 0:
            result.pop(0)
            result.pop(-1)

        return result

I save this in a file called parser.py and use it. The feeling of use is as follows, and it is relatively easy to divide words by specifying the part of speech.

>>> from parser import MecabParser
>>> mp =  MecabParser(word_classes=['noun'], word_class_details=['General','固有noun'])
>>> text = 'I'm hungry today, so I came to eat one of the best ramen in the neighborhood.'
>>> mp.parse(text)
['stomach', 'Neighborhood', 'Tenkaippin', 'ramen']

The dictionary uses mecab-ipadic-neologd, words with changed endings are corrected to their original form, and the text is formatted in advance, so I hope that the user can modify it to their liking. think.

Recommended Posts

I wrote a class that makes it easier to divide by specifying part of speech when using Mecab in python
[Python] I wrote a test of "Streamlit" that makes it easy to create visualization applications.
A module that makes it easier to write Perl-like filter programs in Python fileinput
I made a class to get the analysis result by MeCab in ndarray with python
When I try to divide a list with MeCab, I get'TypeError: in method'Tagger_parse', argument 2 of type'char const *''
I made a Discord bot in Python that translates when it reacts
I want to color a part of an Excel string in Python
A super introduction to Django by Python beginners! Part 2 I tried using the convenient functions of the template
A memo that I wrote a quicksort in Python
I wrote a class in Python3 and Java
I tried to create a class that can easily serialize Json in Python
I made a tool that makes it a little easier to create and install a public key.
I tried using a library (common thread) that makes Python's threading package easier to use
How to create an instance of a particular class from dict using __new__ () in python
A story that makes it easy to estimate the living area using Elasticsearch and Python
Summary of points to keep in mind when writing a program that runs on Python 2.5
I wrote a code that exceeds 100% recovery rate in horse racing prediction using LightGBM (Part 2)
A story that makes it easier to see Model debugging in the Django + SQLAlchemy environment
I tried to make a stopwatch using tkinter in python
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
When I got a list of study sessions in Python, I found something I wanted to make
[Python3] List of sites that I referred to when I started Python
When I tried to scrape using requests in python, I was addicted to SSLError, so a workaround memo
I tried to make a site that makes it easy to see the update information of Azure
[Python] The role of the asterisk in front of the variable. Divide the input value and assign it to a variable
A story that didn't work when I tried to log in with the Python requests module
Part 1 I wrote the answer to the reference problem of how to write offline in real time in Python
I tried to make a regular expression of "amount" using Python
I tried to make a regular expression of "date" using Python
I wrote the code to write the code of Brainf * ck in python
I wrote a function to load a Git extension script in Python
I wrote a script to extract a web page link in Python
I get a can't set attribute when using @property in python
Note that I understand the algorithm of the machine learning naive Bayes classifier. And I wrote it in Python.
A super introduction to Django by Python beginners! Part 3 I tried using the template file inheritance function
I wrote a code to convert quaternions to z-y-x Euler angles in Python
I made a web application in Python that converts Markdown to HTML
I want to explain the abstract class (ABCmeta) of Python in detail.
I wrote a corpus reader that reads the results of MeCab analysis
I tried to develop a Formatter that outputs Python logs in JSON
I made a tool that makes decompression a little easier with CLI (Python3)
A memorandum because I stumbled on trying to use MeCab in Python
When I try to go back using chainer, it fits a little
How to sort by specifying a column in the Python Numpy array.
Part 1 I wrote an example of the answer to the reference problem of how to write offline in real time in Python
Try to find the probability that it is a multiple of 3 and not a multiple of 5 when one is removed from a card with natural numbers 1 to 100 using Ruby and Python.
I want to create a priority queue that can be updated in Python (2.7)
When I tried to create a virtual environment with Python, it didn't work
Note that I understand the least squares algorithm. And I wrote it in Python.
I made a module in C language to filter images loaded by Python
I made a script to record the active window using win32gui of Python
A story that I was addicted to when I made SFTP communication with python
I tried to make a function to retrieve data from database column by column using sql with sqlite3 of python [sqlite3, sql, pandas]
When I try to divide with Bert Japanese Tokenizer of Hugging Face, it fails with initializing of MeCab or even with encode.
A memo to generate a dynamic variable of class from dictionary data (dict) that has only standard type data in Python3
I want to create a window in Python
When I try matplotlib in Python, it says'cairo.Context'
I wrote "Introduction to Effect Verification" in Python
I wrote a Japanese parser in Japanese using pyparsing.
A Python script that crawls RSS in Azure Status and posts it to Hipchat
A story that was convenient when I tried using the python ip address module