Recently, I started using Python's language processing tool "GiNZA". I used to use MeCab, but I recently learned (embarrassingly) that Python has a library that incorporates some state-of-the-art machine learning technology, so I'm currently migrating to GiNZA. Since this is the first GiNZA, I tried to summarize the processing flow as a memorandum while referring to various sites. The author is a beginner in natural language analysis and there are many parts that are not reached, so if you want to learn more deeply, please refer to the official documents. This article is written with the hope that the same beginners as the author will think, "GiNZA can do this! Let's use it myself!"
As many people have already published, GiNZA is a natural language processing library that uses a learning model of the results of joint research between Recruit's AI research institute Megagon Labs and the National Institute of Japanese Language.
Overview of "GiNZA" "GiNZA" is a Japanese natural language processing open source library with features such as one-step introduction, high-speed and high-precision analysis processing, and internationalization support for word-dependent structure analysis level. "GiNZA" uses the natural language processing library "spaCy" (* 5) that incorporates the latest machine learning technology as a framework, and also has an open source morphological analyzer "SudachiPy" (* 6) inside. It is incorporated in and used for tokenization processing. The "GiNZA Japanese UD Model" incorporates the results of joint research between Megagon Labs and the National Institute for Japanese Language and Language.
Quoted from https://www.recruit.co.jp/newsroom/2019/0402_18331.html
It seems that the language processing library "spaCy" is used inside GiNZA. As described in here, GiNZA now supports spaCy in Japanese. I roughly interpret it as a library that I did. In addition, "Sudachi Py" is used for morphological analysis. Many people reading this article will want to parse Japanese, so it's an attractive library for Python users!
(As of 01/04/2020, the latest version of GiNZA is 2.2.1.)
GiNZA can be installed with a single pip command.
$ pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
$ pip install ginza
You can install it with! (As of January 21, 2020, the latest version of GiNZA is 3.1.1.) Please check the Official Site for details.
First, perform basic morphological analysis.
Dependency analysis
import spacy
nlp = spacy.load('ja_ginza')
doc = nlp('This year's zodiac is Yang Metal Rat. I'm looking forward to the Tokyo Olympics.')
for sent in doc.sents:
for token in sent:
print(token.i, token.orth_, token.lemma_, token.pos_,
token.tag_, token.dep_, token.head.i)
Output result
0 this year this year NOUN noun-Appellative-Adverbs possible nmod 2
1 ADP particle-Case particle case 0
2 Zodiac Zodiac NOUN Noun-Appellative-General nsubj 4
3 is the ADP particle-Particle case 2
4 Yang Metal Rat NOUN Noun-Appellative-General ROOT 4
5 is AUX auxiliary verb aux 4
6. .. PUNCT auxiliary symbol-Kuten punct 4
7 Tokyo Tokyo PROPN Noun-Proper noun-Place name-General compound 9
8 Olympic Olympic NOUN noun-Appellative-General compound 9
9 Fun Fun NOUN Noun-Appellative-General ROOT 9
10 AUX auxiliary verb cop 9
11 Nana PART particle-Final particle aux 9
12. .. PUNCT auxiliary symbol-Kuten punct 9
It is well divided into morphemes. From the left, "input word", "heading word (basic form)", "part of speech", and "part of speech details" (see spaCy API for details of token). .. GiNZA also supports dependency structure analysis, and the relationship between the dependent word number and that word is estimated (see here for details on token.dep_. See / annotation # dependency-parsing)).
GiNZA also allows you to visualize dependencies in a graph. Use displacy
for visualization.
Dependency visualization
from spacy import displacy
displacy.serve(doc, style='dep', options={'compact':True})
After execution, Serving on http://0.0.0.0:5000 ...
will be displayed, so when you access it, the figure will be displayed.
I've only used MeCab, so it's great to see the structure in one line. For more information on visualization techniques, see spaCy Visualizers.
There are several suggested methods for estimating word vectors, but GiNZA already has learned word vectors that can be referenced with the Token vector attribute
.
Word vector
doc = nlp('If you give up, the match ends there')
token = doc[4]
print(token)
print(token.vector)
print(token.vector.shape)
Execution result
match
[-1.7299166 1.3438352 0.51212436 0.8338855 0.42193085 -1.4436126
4.331309 -0.59857213 2.091658 3.1512427 -2.0446565 -0.41324708
...
1.1213776 1.1430703 -1.231743 -2.3723211 ]
(100,)
The number of dimensions of the word vector is 100.
You can also measure the cosine similarity between word vectors by using the similarity ()
method.
similarity
word1 = nlp('Rice ball')
word2 = nlp('rice ball')
word3 = nlp('curry')
print(word1.similarity(word2))
#0.8016603151410209
print(word1.similarity(word3))
#0.5304326270109458
The cosine similarity ranges from 0-1 and the closer it is to 1, the more similar the words are. In fact, rice balls are closer to rice balls than curry. In addition, vectorization and cosine similarity can be calculated using the same procedure for documents instead of words. The vector in the document seems to return the average of the word vectors that make up the sentence.
Finally, since we were able to express words and documents as vectors, let's illustrate them in vector space. Since the dimension of the vector is 100 dimensions, this time we will plot it after dropping it to 2 dimensions using principal component analysis.
plot
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
#text2vector
vec1 = nlp('happy New Year').vector
vec2 = nlp('I bought cabbage yesterday').vector
vec3 = nlp('Let's go see a movie').vector
vec4 = nlp('I want to eat curry').vector
vec5 = nlp('I went to the town to shop').vector
vec6 = nlp('Chocolate I ate yesterday').vector
#pca
vectors = np.vstack((vec1, vec2, vec3, vec4, vec5, vec6))
pca = PCA(n_components=2).fit(vectors)
trans = pca.fit_transform(vectors)
pc_ratio = pca.explained_variance_ratio_
#plot
plt.figure()
plt.scatter(trans[:,0], trans[:,1])
i = 0
for txt in ['text1','text2','text3','text4','text5','text6']:
plt.text(trans[i,0]-0.2, trans[i,1]+0.1, txt)
i += 1
plt.hlines(0, min(trans[:,0]), max(trans[:,0]), linestyle='dashed', linewidth=1)
plt.vlines(0, min(trans[:,1]), max(trans[:,1]), linestyle='dashed', linewidth=1)
plt.xlabel('PC1 ('+str(round(pc_ratio[0]*100,2))+'%)')
plt.ylabel('PC2 ('+str(round(pc_ratio[1]*100,2))+'%)')
plt.tight_layout()
plt.show()
Execution result
The amount of information has dropped, but I think it is easier to recognize when dealing with a large amount of data. Looking at this figure, it seems that text3 and 5 are close to each other, and text4 and 6 are close to each other. I feel that way with my skin.
Although I was a beginner in natural language processing, I was able to easily analyze everything from morphological analysis to vectorization by using GiNZA. Recommended for those who want to start language processing from now on. I would appreciate it if you could point out any mistakes or strange expressions.