TL; DR
Introducing konoha, a library for tokenizing sentences. (Old tiny_tokenizer) You can use it like ↓. What is it ~
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence)) # -> [Nature,language,processing,To,study,Shi,hand,I,Masu]
tokenizer = WordTokenizer('Kytea')
print(tokenizer.tokenize(sentence)) # -> [Nature,language,processing,To,study,Shi,hand,I,Ma,Su]
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence)) # -> [▁,Nature,language,processing,To,study,Shi,Is]
Unlike languages such as English, Japanese does not have clear delimiters at word boundaries. For this reason, when analyzing Japanese, it is first necessary to divide the sentence into some units (for example, words). This division process is divided into word units, words are further divided into subword units, and so on. And there are things that divide the character string in the sentence into each character. In this paper, the substrings divided by the above units are called tokens. There are various methods for tokenization. Morphological analyzers, which are widely used in the analysis of Japanese text, also perform word segmentation. (In morphological analysis, in addition to word division, word heading and part of speech tag estimation are performed.)
For word-by-word division, use a dictionary to build a lattice, and then use MeCab to determine the optimal word sequence. There are algorithms such as Kytea that determine word boundaries at the character level. These algorithms may return the same split result or different split results. Also, even if the division algorithm is the same, the unit of the word will change if the part of speech system is different.
-IPADic part of speech system -UniDic Part of Speech System
Subwords are more subdivided words. Its effectiveness has been confirmed by neural machine translation. As a typical subword unit tokenizer used in Japanese text analysis Sentencepiece is famous.
-Detailed description of MeCab (Word splitting is also mentioned) -Explanation of Kytea's word division method -Comparison of various part-speech systems and tokenizers -Explanation of Sentencepiece
For those who do Japanese text mining Do you usually use MeCab + NEologd? Studies may often use Kytea. Also, in the recently talked about dependency analysis library Ginza Using a morphological analyzer called SudachiPy (a Python implementation of the morphological analyzer Sudachi), etc. Various analyzers are used for word-level parsing. It is difficult to determine which analyzer is best for you.
Furthermore, in recent years mainly in the context of machine translation "The task performance is better when the tokens are divided into subwords than when the morphological analyzer's word-separation results are used." It has also been reported that subword-based division is often adopted.
Character-level tokenization is characterized by a small number of character types compared to the number of word types. The number of word types is generally much larger than the number of character types, and tokenization at the character level has the effect of reducing vocabulary size. For example, in the study of named entity extraction, there is an approach of adding character-level features to LSTM features. Many recent studies have adopted this approach. (The paper listed in the link is old, but it is my favorite paper)
In this situation, in what unit should we tokenize the sentence? In general, I understand that the answer to this question is "** task dependent **" and there is no single answer. For this reason, "I use MeCab + NEologd because many people use it." "Since the dissertation uses subwords, I will use subwords for the time being." Such options tend to be taken.
For modern natural language processing tasks (especially when using neural networks) There are many things that need to be paid as much as tokenization. (Example: architecture, hidden layer dimensions, optimizer, learning rate ... etc), Against this background, the method of tokenization is at the beginning of tackling the problem. I think the current situation is that it is often decided to be "No". But are other tokenization methods really worth trying? I find it worthwhile to try different tokenization methods, so We have developed a library to easily switch the tokenization method. This is the reason for the development of konoha.
Switching tokenizers often costs a little. All of the morphological analyzers and tokenizers shown above have Python wrappers. Users can use these tools from Python by installing the wrapper library.
However, each wrapper library provides APIs with different idioms. (I think it's natural because each parser and their wrapper library authors are different.) Therefore, if you want to switch the output of multiple analyzers according to the situation, You need to implement your own layer to absorb the differences in the idioms of those APIs.
There is a library called JapaneseTokenizer. (GitHub repository: Kensuke-Mitsuzawa / JapaneseTokenizers) Like konoha, JapaneseTokenizer also provides wrappers for multiple tokenizers. JapaneseTokenizer provides an interface that handles multiple morphological analyzers. JapaneseTokenizer can be used to filter sentence analysis results by specific part of speech tags, etc. Many practical functions that are useful for text analysis are implemented. It is a very convenient tool for performing natural language processing that utilizes the results of multiple morphological analyzers.
konoha
On the other hand, tiny tokenizer does not provide any function such as filtering of part of speech at this time. tiny tokenizer is a library that abstracts the tokenization process of each analyzer. It provides subword-based division and character-level division functions that JapaneseTokenizer does not target.
The position of this library is a wrapper of Python wrapper. Thanks to everyone who provided the Python wrapper for the parser, The purpose of this library is to absorb the differences in the interfaces of those libraries. By using konoha, users will be able to use multiple analyzers with a unified API.
First, I will show you an example using MeCab.
In this example, the dictionary uses mecab-ipadic.
If you are using macOS, mecab
, mecab-ipadic
,
If you are using Ubuntu, please install libmecab-dev
in addition to the above.
Operation has not been verified for other distributions.
(If you can run mecab
, mecab-config
and have the dictionary installed, it should work fine)
If you build the Dockerfile in the GitHub repository and create the environment, you will be ready to go.
--Code
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('MeCab')
print(tokenizer.tokenize(sentence))
--Output
[Nature,language,processing,To,study,Shi,hand,I,Masu]
Next, let's use Kytea. Again, you need to build Kytea. (Also, please refer to the Dockerfile in the repository.)
--Code
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('Kytea')
print(tokenizer.tokenize(sentence))
--Output
[Nature,language,processing,To,study,Shi,hand,I,Ma,Su]
Also, if you want to divide a sentence into subwords, you can use Sentencepiece.
When using Sentencepiece, it is necessary to specify the model file.
Pass the path to the model file to model_path
.
--Code
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('Sentencepiece', model_path="data/model.spm")
print(tokenizer.tokenize(sentence))
--Output
[▁,Nature,language,processing,To,study,Shi,Is]
In this way, multiple analyzers can be used in a unified manner simply by changing the value of the argument passed to WordTokenizer
.
This makes it easy to experiment with different tokenizers during the experimental phase.
A morphological analyzer is also included in the tokenizer.
Of the tokenizers currently supported by konoha
The morphological analyzers are MeCab
, Kytea
and Sudachi (SudachiPy)
.
Regarding these, whether or not to obtain the information given by the morphological analyzer such as the part of speech tag when tokenization is performed.
It can be controlled as an option.
An example of using SudachiPy
is shown below.
--Code
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('Sudachi', mode='A', with_postag=True)
print(tokenizer.tokenize(sentence))
--Output
[Nature(noun),language(noun),processing(noun),To(Particle),study(noun),Shi(verb),hand(
Particle),I(verb),Masu(助verb)]
The output of tokenizer.tokenize
is an instance of the Token
class.
The following instance variables are defined in the Token
class.
(Excerpt from docstring of Token
class)
Information that the analyzer does not return is None
.
For example, token.normalized_form
uses SudachiPy
and
And only when with_postag
is True
, the value is not None
.
(Token
is one element of the array of token sequences output by tokenizer.tokenize
)
"""
surface (str)
surface (original form) of a word
postag (str, default: None)
part-of-speech tag of a word (optional)
postag2 (str, default: None)
detailed part-of-speech tag of a word (optional)
postag3 (str, default: None)
detailed part-of-speech tag of a word (optional)
postag4 (str, default: None)
detailed part-of-speech tag of a word (optional)
inflection (str, default: None)
conjugate type of word (optional)
conjugation (str, default: None)
conjugate type of word (optional)
base_form (str, default: None)
base form of a word
yomi (str, default: None)
yomi of a word (optional)
pron (str, default: None)
pronounciation of a word (optional)
normalized_form (str, default: None)
normalized form of a word (optional)
Note that normalized_form is only
available on SudachiPy
"""
If you want to use a user dictionary, ʻuser_dictionary_path of
WordTokenizer`
Pass the path to the user dictionary as an argument named.
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('MeCab', user_dictionary_path="path/to/user_dict")
print(tokenizer.tokenize())
I want to use mecab-ipadic-NEologd
,
Or if you want to use the system dictionary that you relearned using the corpus yourself,
It is possible to generate a tokenizer by specifying a system dictionary.
Since the argument system_dictionary_path
is generated in WordTokenizer
,
Give it the path to the system dictionary you want to use.
from konoha import WordTokenizer
sentence = 'I'm studying natural language processing'
tokenizer = WordTokenizer('MeCab', system_dictionary_path="path/to/system_dict")
print(tokenizer.tokenize())
In this paper, it is a library for using multiple tokenizers with the same interface. We introduced konoha. By using this library when wondering which analyzer to use at the beginning of text analysis It is possible to easily switch the analyzer. Also, I plan to experiment with MeCab, but my previous research uses other analyzers. Even in the case where you have to write code to use another analyzer for comparison By inserting konoha, you can experiment with the same code without hassle. I hope it helps people who process natural language in the field and those who process natural language in research. Please use it if you like, thank you.
To build Kytea on Ubuntu 18.04, this pull request Need to be imported. Please refer to the Dockerfile in the konoha repository.
Recommended Posts