[PYTHON] [For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)

Introduction

Recently, I started using Python's language processing tool "GiNZA". I used to use MeCab, but I recently learned (embarrassingly) that Python has a library that incorporates some state-of-the-art machine learning technology, so I'm currently migrating to GiNZA. Since this is the first GiNZA, I tried to summarize the processing flow as a memorandum while referring to various sites. The author is a beginner in natural language analysis and there are many parts that are not reached, so if you want to learn more deeply, please refer to the official documents. This article is written with the hope that the same beginners as the author will think, "GiNZA can do this! Let's use it myself!"

About GiNZA

As many people have already published, GiNZA is a natural language processing library that uses a learning model of the results of joint research between Recruit's AI research institute Megagon Labs and the National Institute of Japanese Language.

Overview of "GiNZA" "GiNZA" is a Japanese natural language processing open source library with features such as one-step introduction, high-speed and high-precision analysis processing, and internationalization support for word-dependent structure analysis level. "GiNZA" uses the natural language processing library "spaCy" (* 5) that incorporates the latest machine learning technology as a framework, and also has an open source morphological analyzer "SudachiPy" (* 6) inside. It is incorporated in and used for tokenization processing. The "GiNZA Japanese UD Model" incorporates the results of joint research between Megagon Labs and the National Institute for Japanese Language and Language.

Quoted from https://www.recruit.co.jp/newsroom/2019/0402_18331.html

It seems that the language processing library "spaCy" is used inside GiNZA. As described in here, GiNZA now supports spaCy in Japanese. I roughly interpret it as a library that I did. In addition, "Sudachi Py" is used for morphological analysis. Many people reading this article will want to parse Japanese, so it's an attractive library for Python users!

Development environment

(As of 01/04/2020, the latest version of GiNZA is 2.2.1.)

GiNZA can be installed with a single pip command.

$ pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
$ pip install ginza

You can install it with! (As of January 21, 2020, the latest version of GiNZA is 3.1.1.) Please check the Official Site for details.

Morphological analysis

First, perform basic morphological analysis.

Dependency analysis


import spacy

nlp = spacy.load('ja_ginza')
doc = nlp('This year's zodiac is Yang Metal Rat. I'm looking forward to the Tokyo Olympics.')

for sent in doc.sents:
    for token in sent:
        print(token.i, token.orth_, token.lemma_, token.pos_, 
              token.tag_, token.dep_, token.head.i)

Output result

0 this year this year NOUN noun-Appellative-Adverbs possible nmod 2
1 ADP particle-Case particle case 0
2 Zodiac Zodiac NOUN Noun-Appellative-General nsubj 4
3 is the ADP particle-Particle case 2
4 Yang Metal Rat NOUN Noun-Appellative-General ROOT 4
5 is AUX auxiliary verb aux 4
6. .. PUNCT auxiliary symbol-Kuten punct 4
7 Tokyo Tokyo PROPN Noun-Proper noun-Place name-General compound 9
8 Olympic Olympic NOUN noun-Appellative-General compound 9
9 Fun Fun NOUN Noun-Appellative-General ROOT 9
10 AUX auxiliary verb cop 9
11 Nana PART particle-Final particle aux 9
12. .. PUNCT auxiliary symbol-Kuten punct 9

It is well divided into morphemes. From the left, "input word", "heading word (basic form)", "part of speech", and "part of speech details" (see spaCy API for details of token). .. GiNZA also supports dependency structure analysis, and the relationship between the dependent word number and that word is estimated (see here for details on token.dep_. See / annotation # dependency-parsing)).

GiNZA also allows you to visualize dependencies in a graph. Use displacy for visualization.

Dependency visualization


from spacy import displacy

displacy.serve(doc, style='dep', options={'compact':True})

After execution, Serving on http://0.0.0.0:5000 ... will be displayed, so when you access it, the figure will be displayed.

displacy_tri.png

I've only used MeCab, so it's great to see the structure in one line. For more information on visualization techniques, see spaCy Visualizers.

Text vectorization

There are several suggested methods for estimating word vectors, but GiNZA already has learned word vectors that can be referenced with the Token vector attribute.

Word vector


doc = nlp('If you give up, the match ends there')
token = doc[4]

print(token)
print(token.vector)
print(token.vector.shape)

Execution result

match
[-1.7299166   1.3438352   0.51212436  0.8338855   0.42193085 -1.4436126
  4.331309   -0.59857213  2.091658    3.1512427  -2.0446565  -0.41324708
 ...
  1.1213776   1.1430703  -1.231743   -2.3723211 ]
(100,)

The number of dimensions of the word vector is 100. You can also measure the cosine similarity between word vectors by using the similarity () method.

similarity


word1 = nlp('Rice ball')
word2 = nlp('rice ball')
word3 = nlp('curry')

print(word1.similarity(word2))
#0.8016603151410209
print(word1.similarity(word3))
#0.5304326270109458

The cosine similarity ranges from 0-1 and the closer it is to 1, the more similar the words are. In fact, rice balls are closer to rice balls than curry. In addition, vectorization and cosine similarity can be calculated using the same procedure for documents instead of words. The vector in the document seems to return the average of the word vectors that make up the sentence.

Finally, since we were able to express words and documents as vectors, let's illustrate them in vector space. Since the dimension of the vector is 100 dimensions, this time we will plot it after dropping it to 2 dimensions using principal component analysis.

plot


import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

#text2vector
vec1 = nlp('happy New Year').vector
vec2 = nlp('I bought cabbage yesterday').vector
vec3 = nlp('Let's go see a movie').vector
vec4 = nlp('I want to eat curry').vector
vec5 = nlp('I went to the town to shop').vector
vec6 = nlp('Chocolate I ate yesterday').vector

#pca
vectors = np.vstack((vec1, vec2, vec3, vec4, vec5, vec6))
pca = PCA(n_components=2).fit(vectors)
trans = pca.fit_transform(vectors)
pc_ratio = pca.explained_variance_ratio_

#plot
plt.figure()
plt.scatter(trans[:,0], trans[:,1])

i = 0
for txt in ['text1','text2','text3','text4','text5','text6']:
    plt.text(trans[i,0]-0.2, trans[i,1]+0.1, txt)
    i += 1

plt.hlines(0, min(trans[:,0]), max(trans[:,0]), linestyle='dashed', linewidth=1)
plt.vlines(0, min(trans[:,1]), max(trans[:,1]), linestyle='dashed', linewidth=1)
plt.xlabel('PC1 ('+str(round(pc_ratio[0]*100,2))+'%)')
plt.ylabel('PC2 ('+str(round(pc_ratio[1]*100,2))+'%)')
plt.tight_layout()
plt.show()

Execution result

PCA.png

The amount of information has dropped, but I think it is easier to recognize when dealing with a large amount of data. Looking at this figure, it seems that text3 and 5 are close to each other, and text4 and 6 are close to each other. I feel that way with my skin.

Finally

Although I was a beginner in natural language processing, I was able to easily analyze everything from morphological analysis to vectorization by using GiNZA. Recommended for those who want to start language processing from now on. I would appreciate it if you could point out any mistakes or strange expressions.

Reference site

Recommended Posts

[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
Natural language processing 1 Morphological analysis
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
From preparation for morphological analysis with python using polyglot to part-of-speech tagging
I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
Let the COTOHA API do the difficult things-Introduction to "learn using" natural language processing-
I tried to extract named entities with the natural language processing library GiNZA
100 natural language processing knocks Chapter 4 Morphological analysis (first half)
3. Natural language processing with Python 4-1. Analysis for words with KWIC
100 language processing knock-30 (using pandas): reading morphological analysis results
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
3. Natural language processing with Python 5-2. Emotion intensity analysis tool VADER
Try the book "Introduction to Natural Language Processing Application Development in 15 Steps" --Chapter 2 Step 03 Memo "Morphological Analysis and Word Separation"
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock Chapter 4: Morphological Analysis
Dockerfile with the necessary libraries for natural language processing in python
Loose articles for those who want to start natural language processing
From the introduction of JUMAN ++ to morphological analysis of Japanese with Python
Preparing to start natural language processing
[Language processing 100 knocks 2020] Chapter 4: Morphological analysis
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
From the introduction of GoogleCloudPlatform Natural Language API to how to use it
Tips for Python beginners to use the Scikit-image example for themselves 9 Use from C
100 language processing knocks Chapter 4: Morphological analysis 31. Verbs
For those who want to perform natural language processing using WikiPedia's knowledge that goes beyond simple keyword matching
3. Natural language processing with Python 3-1. Important word extraction tool TF-IDF analysis [original definition]
The fastest way for beginners to master Python
How to use data analysis tools for beginners
100 language processing knocks Morphological analysis learned in Chapter 4
[For beginners] Introduction to vectorization in machine learning
Morphological analysis tool installation (MeCab, Human ++, Janome, GiNZA)
[For beginners] How to display maps and search boxes using the GoogleMap Javascript API
I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.
Set up a development environment for natural language processing
[For beginners] How to study Python3 data analysis exam
Model using convolutional neural network in natural language processing
Building an environment for natural language processing with Python
Deep Learning 2 Made from Zero Natural Language Processing 1.3 Summary
Compare how to write processing for lists by language
100 natural language processing knocks Chapter 5 Dependency analysis (second half)
I tried to identify the language using CNN + Melspectogram
100 natural language processing knocks Chapter 5 Dependency analysis (first half)
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
Send a message from the server to your Chrome extension using Google Cloud Messaging for Chrome
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
Extracting papers from ACL2020, an international conference on natural language processing, using Python's arXiv API
Tips for Python beginners to use the Scikit-image example for themselves 8 Processing time measurement and profiler