Created a library for python that can easily handle morpheme division

Content of this article

What you can do with this library

Installation

Mecab and Mecab-neologd

You have to install MeCab in the first place.

But I think the pre-processing craftsmen are already ready, so I'll omit it.

It also supports calling the Mecab-neologd dictionary, so it's a good idea to have it installed.

Installation of the main unit

You can use it with python setup.py install by doing git clone [email protected]: Kensuke-Mitsuzawa / JapaneseTokenizers.git.

Or you can do it with pip install git + https://github.com/Kensuke-Mitsuzawa/JapaneseTokenizers.

How to use

The same content is written in example, so I will write it briefly.

Morpheme division

Prepare the input statement

`sentence = u'Tehran (Persian: تهران; Tehrān Tehran.ogg pronunciation [help / file] / teɦˈrɔːn /, English: Tehran) is the capital of West Asia, Iran and the capital of Tehran Province. Population 12,223,598. The metropolitan population reaches 13,413,348. '``

python2x can only input ʻunicode`.

It doesn't matter which one you use for python3x.

Specifies the type of os. Other than centOS, ʻos Type =" generic "is fine. Only centOS should have ʻos Type = "centsos". (Because the system command of Mecab is different only for centOS. There may be other OSs like that ... I have confirmed that it works on Ubuntu and Mac.)

Specifies the type of dictionary.

Initialize the instance mecab_wrapper = MecabWrapper(dictType=dictType, osType=osType)

Split words. tokenized_obj = mecab_wrapper.tokenize(sentence=sentence)

By default, words and part of speech are returned in a tapple pair.

tokenized_obj = mecab_wrapper.tokenize(sentence=sentence, return_list=False)

Will return this class object, so if you want to use it for other processing, this It is better to specify the flag.

filtering

Filtering

The stop word is

stopwords = [u'Tehran']

Put a string in the list like this. (Both str and ʻunicode` are acceptable)

To specify by part of speech, specify as [(part of speech tuple)].

Part of speech can be specified up to 3 levels. For example, in IPADIC Part of Speech System, if you want noun-proper noun-personal name,(u'noun', Write u'proper noun', u'personal name').

If you want to specify up to noun-proper noun, use (u'noun', u'proper noun).

Again, you can enter both str and ʻunicode`.

Place the part of speech tuple you want to acquire in the list.

pos_condition = [(u'noun', u'proper noun'), (u'verb', u'independence')]

Perform filtering.

filtered_obj = mecab_wrapper.filter(
    parsed_sentence=tokenized_obj,
    pos_condition=pos_condition
)

The return value is this class object

Why did you make such a thing?

To summarize briefly

It is a motivation.

I've been in charge of natural language processing for a long time ... I'm a saint who does pre-processing day after day, and sometimes even pre-processing for other people's research. .. .. ..

But at one point, I suddenly noticed __ "Isn't the morpheme division part writing the same process every time?" __

So, while doing the same thing over and over again, I have packaged only the processes that I have used (and will probably use) most often.

A package that can be similar in Python is natto.

However, I felt inconvenient because I had to write the filtering process in natto and I couldn't add the dictionary, so I made a new one.

Whether you are a pre-processing craftsman or an active pre-processing craftsman! I hope that you can reduce your work as much as possible and enjoy NLP.

Recommended Posts

Created a library for python that can easily handle morpheme division
Created a Python library DateTimeRange that handles time ranges
About psd-tools, a library that can process psd files in Python
[For beginners] How to register a library created in Python in PyPI
A tool for easily entering Python code
Try using APSW, a Python library that SQLite can get serious about
Created Simple SQLite, a Python library that simplifies SQLite table creation / data insertion
Created a Python wrapper for the Qiita API
Created a header-only library management tool for C / C ++
You can easily create a GUI with Python
A class for PYTHON that can be operated without being aware of LDAP
I tried to create a class that can easily serialize Json in Python
I registered PyQCheck, a library that can perform QuickCheck with Python, in PyPI.
How to install a Python library that can be used by pharmaceutical companies
I made a VM that runs OpenCV for Python
From a book that programmers can learn ... (Python): Pointer
A function that easily calculates a listwise removal tree (Python)
[Python] A convenient library that converts kanji to hiragana
Publish / upload a library created in Python to PyPI
Library "apywrapper" to easily develop a wrapper for RESTful API
[Python] Created a transformation app for world champion "Mr. Satan"
From a book that programmers can learn ... (Python): About sorting
From a book that programmers can learn (Python): Decoding messages
A sample for drawing points with PIL (Python Imaging Library).
Library for specifying a name server and dig with python
Understand the probabilities and statistics that can be used for progress management with a python program
[Python] Make a graph that can be moved around with Plotly
I made a library to easily read config files with Python
[Python] I made my own library that can be imported dynamically
I made a package that can compare morphological analyzers with Python
Created gomi, a trash can tool for rm in Go language
Use networkx, a library that handles graphs in python (Part 2: Tutorial)
A story that struggled to handle the Python package of PocketSphinx
I created a Python library to call the LINE WORKS API
Try using virtualenv, which can build a virtual environment for Python
From a book that programmers can learn (Python): Find the mode
From a book that programmers can learn ... (Python): Review of arrays
I made a shuffle that can be reset (reverted) with Python
[python] I created a follow-up correlation diagram for twitter (Gremlin edition)
I made a library that adds docstring to a Python stub file.
A python script that deletes ._DS_Store and ._ * files created on Mac
[python] I made a class that can write a file tree quickly
I made a Python wrapper library for docomo image recognition API.
From a book that programmers can learn (Python): Statistical processing-deviation value
I wrote a tri-tree that can be used for high-speed dictionary implementation in D language and Python.
Introduction of "scikit-mobility", a library that allows you to easily analyze human flow data with Python (Part 1)
Easily handle lists with python + sqlite3
<For beginners> python library <For machine learning>
Easily handle databases with Python (SQLite3)
Created AtCoder test tool for Python
Easily expand shortened URLs for Python
Generate a Python library download badge
A note about mock (Python mock library)
[Python] A program that finds a pair that can be divided by a specified value
Introducing a library that was not included in pip on Python / Windows
I created a Go library nzargv that arranges command line arguments nicely.
[Python] Created a Twitter bot that generates friend-like tweets using Markov chains
How to make a rock-paper-scissors bot that can be easily moved (commentary)
Create a web app that can be easily visualized with Plotly Dash
Make a Discord Bot that you can search for and paste images
Mathematical optimization that can be used for free work with Python + PuLP