TL; DR

Introducing konoha, a library for tokenizing sentences. (Old tiny_tokenizer) You can use it like ↓. What is it ~

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))  # -> [Nature,language,processing,To,study,Shi,hand,I,Masu]

tokenizer = WordTokenizer('Kytea')
print(tokenizer.tokenize(sentence))  # -> [Nature,language,processing,To,study,Shi,hand,I,Ma,Su]

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))  # -> [▁,Nature,language,processing,To,study,Shi,Is]

Introduction: What is a tokenizer?

Unlike languages such as English, Japanese does not have clear delimiters at word boundaries. For this reason, when analyzing Japanese, it is first necessary to divide the sentence into some units (for example, words). This division process is divided into word units, words are further divided into subword units, and so on. And there are things that divide the character string in the sentence into each character. In this paper, the substrings divided by the above units are called tokens. There are various methods for tokenization. Morphological analyzers, which are widely used in the analysis of Japanese text, also perform word segmentation. (In morphological analysis, in addition to word division, word heading and part of speech tag estimation are performed.)

For word-by-word division, use a dictionary to build a lattice, and then use MeCab to determine the optimal word sequence. There are algorithms such as Kytea that determine word boundaries at the character level. These algorithms may return the same split result or different split results. Also, even if the division algorithm is the same, the unit of the word will change if the part of speech system is different.

-IPADic part of speech system -UniDic Part of Speech System

Subwords are more subdivided words. Its effectiveness has been confirmed by neural machine translation. As a typical subword unit tokenizer used in Japanese text analysis Sentencepiece is famous.

-Detailed description of MeCab (Word splitting is also mentioned) -Explanation of Kytea's word division method -Comparison of various part-speech systems and tokenizers -Explanation of Sentencepiece

What to do with the tokenizer

For those who do Japanese text mining Do you usually use MeCab + NEologd? Studies may often use Kytea. Also, in the recently talked about dependency analysis library Ginza Using a morphological analyzer called SudachiPy (a Python implementation of the morphological analyzer Sudachi), etc. Various analyzers are used for word-level parsing. It is difficult to determine which analyzer is best for you.

Furthermore, in recent years mainly in the context of machine translation "The task performance is better when the tokens are divided into subwords than when the morphological analyzer's word-separation results are used." It has also been reported that subword-based division is often adopted.

Character-level tokenization is characterized by a small number of character types compared to the number of word types. The number of word types is generally much larger than the number of character types, and tokenization at the character level has the effect of reducing vocabulary size. For example, in the study of named entity extraction, there is an approach of adding character-level features to LSTM features. Many recent studies have adopted this approach. (The paper listed in the link is old, but it is my favorite paper)

In this situation, in what unit should we tokenize the sentence? In general, I understand that the answer to this question is "** task dependent **" and there is no single answer. For this reason, "I use MeCab + NEologd because many people use it." "Since the dissertation uses subwords, I will use subwords for the time being." Such options tend to be taken.

For modern natural language processing tasks (especially when using neural networks) There are many things that need to be paid as much as tokenization. (Example: architecture, hidden layer dimensions, optimizer, learning rate ... etc), Against this background, the method of tokenization is at the beginning of tackling the problem. I think the current situation is that it is often decided to be "No". But are other tokenization methods really worth trying? I find it worthwhile to try different tokenization methods, so We have developed a library to easily switch the tokenization method. This is the reason for the development of konoha.

Various tokenizers and various APIs

Switching tokenizers often costs a little. All of the morphological analyzers and tokenizers shown above have Python wrappers. Users can use these tools from Python by installing the wrapper library.

However, each wrapper library provides APIs with different idioms. (I think it's natural because each parser and their wrapper library authors are different.) Therefore, if you want to switch the output of multiple analyzers according to the situation, You need to implement your own layer to absorb the differences in the idioms of those APIs.

Preceding case

There is a library called JapaneseTokenizer. (GitHub repository: Kensuke-Mitsuzawa / JapaneseTokenizers) Like konoha, JapaneseTokenizer also provides wrappers for multiple tokenizers. JapaneseTokenizer provides an interface that handles multiple morphological analyzers. JapaneseTokenizer can be used to filter sentence analysis results by specific part of speech tags, etc. Many practical functions that are useful for text analysis are implemented. It is a very convenient tool for performing natural language processing that utilizes the results of multiple morphological analyzers.

konoha

On the other hand, tiny tokenizer does not provide any function such as filtering of part of speech at this time. tiny tokenizer is a library that abstracts the tokenization process of each analyzer. It provides subword-based division and character-level division functions that JapaneseTokenizer does not target.

The position of this library is a wrapper of Python wrapper. Thanks to everyone who provided the Python wrapper for the parser, The purpose of this library is to absorb the differences in the interfaces of those libraries. By using konoha, users will be able to use multiple analyzers with a unified API.

Tokenization

First, I will show you an example using MeCab. In this example, the dictionary uses mecab-ipadic. If you are using macOS, mecab, mecab-ipadic, If you are using Ubuntu, please install libmecab-dev in addition to the above. Operation has not been verified for other distributions. (If you can run mecab, mecab-config and have the dictionary installed, it should work fine) If you build the Dockerfile in the GitHub repository and create the environment, you will be ready to go.

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))

--Output

[Nature,language,processing,To,study,Shi,hand,I,Masu]

Next, let's use Kytea. Again, you need to build Kytea. (Also, please refer to the Dockerfile in the repository.)

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('Kytea')
print(tokenizer.tokenize(sentence))

--Output

[Nature,language,processing,To,study,Shi,hand,I,Ma,Su]

Also, if you want to divide a sentence into subwords, you can use Sentencepiece. When using Sentencepiece, it is necessary to specify the model file. Pass the path to the model file to model_path.

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))

--Output

[▁,Nature,language,processing,To,study,Shi,Is]

In this way, multiple analyzers can be used in a unified manner simply by changing the value of the argument passed to WordTokenizer. This makes it easy to experiment with different tokenizers during the experimental phase.

Part of speech estimation

A morphological analyzer is also included in the tokenizer. Of the tokenizers currently supported by konoha The morphological analyzers are MeCab, Kytea and Sudachi (SudachiPy). Regarding these, whether or not to obtain the information given by the morphological analyzer such as the part of speech tag when tokenization is performed. It can be controlled as an option.

An example of using SudachiPy is shown below.

--Code

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('Sudachi', mode='A', with_postag=True)
print(tokenizer.tokenize(sentence))

--Output

[Nature(noun),language(noun),processing(noun),To(Particle),study(noun),Shi(verb),hand(
Particle),I(verb),Masu(助verb)]

The output of tokenizer.tokenize is an instance of the Token class. The following instance variables are defined in the Token class. (Excerpt from docstring of Token class)

Information that the analyzer does not return is None. For example, token.normalized_form uses SudachiPy and And only when with_postag is True, the value is not None. (Token is one element of the array of token sequences output by tokenizer.tokenize)

"""
surface (str)
    surface (original form) of a word
postag (str, default: None)
    part-of-speech tag of a word (optional)
postag2 (str, default: None)
    detailed part-of-speech tag of a word (optional)
postag3 (str, default: None)
    detailed part-of-speech tag of a word (optional)
postag4 (str, default: None)
    detailed part-of-speech tag of a word (optional)
inflection (str, default: None)
    conjugate type of word (optional)
conjugation (str, default: None)
    conjugate type of word (optional)
base_form (str, default: None)
    base form of a word
yomi (str, default: None)
    yomi of a word (optional)
pron (str, default: None)
    pronounciation of a word (optional)
normalized_form (str, default: None)
    normalized form of a word (optional)
    Note that normalized_form is only
    available on SudachiPy
"""

Use your own user dictionary (MeCab)

If you want to use a user dictionary, ʻuser_dictionary_path of WordTokenizer` Pass the path to the user dictionary as an argument named.

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab', user_dictionary_path="path/to/user_dict")
print(tokenizer.tokenize())

Use your own system dictionary (MeCab)

I want to use mecab-ipadic-NEologd, Or if you want to use the system dictionary that you relearned using the corpus yourself, It is possible to generate a tokenizer by specifying a system dictionary. Since the argument system_dictionary_path is generated in WordTokenizer, Give it the path to the system dictionary you want to use.

from konoha import WordTokenizer

sentence = 'I'm studying natural language processing'

tokenizer = WordTokenizer('MeCab', system_dictionary_path="path/to/system_dict")
print(tokenizer.tokenize())

Summary

In this paper, it is a library for using multiple tokenizers with the same interface. We introduced konoha. By using this library when wondering which analyzer to use at the beginning of text analysis It is possible to easily switch the analyzer. Also, I plan to experiment with MeCab, but my previous research uses other analyzers. Even in the case where you have to write code to use another analyzer for comparison By inserting konoha, you can experiment with the same code without hassle. I hope it helps people who process natural language in the field and those who process natural language in research. Please use it if you like, thank you.

To build Kytea on Ubuntu 18.04, this pull request Need to be imported. Please refer to the Dockerfile in the konoha repository.

[PYTHON] I made a library konoha that switches the tokenizer to a nice feeling

Introduction: What is a tokenizer?

What to do with the tokenizer

Various tokenizers and various APIs

Preceding case

Tokenization

Part of speech estimation

Use your own user dictionary (MeCab)

Use your own system dictionary (MeCab)

Summary