[PYTHON] I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner

Introduction

This article is intended for people who have never touched spaCy/GiNZA to understand and understand what kind of analysis results will be output.

What is spaCy/GiNZA?

GiNZA is an open source Japanese processing library based on Universal Dependencies (UD). It is built on the commercial level natural language processing framework under the MIT license spaCy.

If you have Python installed, it's easy to install.

$ pip install -U ginza

First try moving it as it is

Since the ginza command can be used, it can be analyzed as it is.

$ ginza
Let's have lunch together in Ginza. How about next Sunday?
# text =Let's have lunch together in Ginza.
1 Ginza Ginza PROPN Noun-Proprietary noun-Place name-General_       6       obl     _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=Ginza|NE=B-GPE|ENE=B-City
2 in ADP particle-Case particles_       1       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=De
3 lunch lunch NOUN noun-Common noun-General_       6       obj     _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=lunch
4 to ADP particle-Case particles_       3       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=Wo
5 Your NOUN prefix_       6       compound        _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=Go
6 Together Together VERB Noun-Common noun-Can be changed_       0       root    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=Issho
7 AUX verb to do-Non-independent_       6       advcl   _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=Sa line transformation,Continuous form-General|Reading=Shi
8 Let's AUX auxiliary verb_       6       aux     _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=Auxiliary verb-trout,Will guess form|Reading=Masho
9. .. PUNCT auxiliary symbol-Punctuation_       6       punct   _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。

# text =How about next Sunday?
1 This time NOUN noun-Common noun-Adverbs possible_       3       nmod    _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_I|Reading=Condo
2 ADP particles-Case particles_       1       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=No
3 Sunday Sunday NOUN noun-Common noun-Adverbs possible_       5       nsubj   _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_I|Reading=Nichiyoubi|NE=B-DATE|ENE=B-Day_Of_Week
4 is the ADP particle-Particle_       3       case    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=C
5 How about ADV adverb_       0       root    _       SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=ROOT|Reading=Doe
6 is AUX auxiliary verb_       5       aux     _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=Auxiliary verb-death,End-form-General|Reading=death
7 or PART particle-Final particle_       5       mark    _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=Mosquito
8. .. PUNCT auxiliary symbol-Punctuation_       5       punct   _       SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。

I was able to analyze it safely, but it's hard to see on the console.

Now let's display it in an easy-to-understand manner

This time, I tried to visualize with spaCy Visualizer and Streamlit to make the syntax dependencies and tables easier to see. When drawing syntax dependencies, svg is generated via create_manual () to replace UD terms such as PROPN, ADP, obl, and advcl with Japanese, and drawn with streamlit.image (). ..

input_list = st.text_area("Input string").splitlines()
nlp = spacy.load('ja_ginza')
for input_str in input_list:
    doc = nlp(input_str)
    for sent in doc.sents:
        svg = spacy.displacy.render(create_manual(sent), style="dep", manual=True)
        streamlit.image(svg)

The table is drawn with streamlit.table () and the named entities are drawn with streamlit.components.v1.html (). Click here for the full source code (https://github.com/chai3/ginza-streamlit)

The operation result looks like this.

ginza-streamlit.gif

The input and analysis results are as follows.

Input string

Let's have lunch together in Ginza. How about next Sunday? I am a cat. There is no name yet.

1-1. Let's have lunch together in Ginza.

Syntax dependency

image.png

Details

i(index) 0 1 2 3 4 5 6 7 8
orth(text) Ginza so lunch To Go together Shi まShiょう
lemma(Uninflected word) Ginza so lunch To Go together To do Trout
reading_form(Reading kana) Ginza De lunch Wo Go Issho Shi マShiョウ
pos(PartOfSpeech) PROPN ADP NOUN ADP NOUN VERB AUX AUX PUNCT
pos(Part of speech) Proprietary noun Set words noun Set words noun verb 助verb 助verb Punctuation
tag(Part of speech details) noun-固有noun-Place name-General Particle-格Particle noun-普通noun-General Particle-格Particle prefix noun-普通noun-Can be changed verb-Non-independent 助verb Auxiliary symbol-Punctuation
inflection(Utilization information) - - - - - - Sa line transformation continuous form-General Auxiliary verb-Mass will guess form -
ent_type(Entity type) City - - - - - - - -
ent_iob(Entity IOB) B O O O O O O O O
lang(language) ja ja ja ja ja ja ja ja ja
dep(dependency) obl case obj case compound ROOT advcl aux punct
dep(Syntax dependency) Accusative element Case sign Object Case sign Compound word ROOT Adverbial modifier clause Auxiliary verb Punctuation
head.i(Parent index) 5 0 5 2 5 5 5 5 5
bunsetu_bi_label B I B I B I I I I
bunsetu_position_type SEM_HEAD SYN_HEAD SEM_HEAD SYN_HEAD CONT ROOT SYN_HEAD SYN_HEAD CONT
is_bunsetu_head TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE
ent_label_ontonotes B-GPE O O O O O O O O
ent_label_ene B-City O O O O O O O O

Phrase break

Let's have lunch/together in Ginza.

Division of head section and phrase of phrase

Ginza (NP)/Lunch (NP)/Together (VP)

Named entity (entity)

image.png

1-2. How about next Sunday?

Syntax dependency

image.png

Details

i(index) 9 10 11 12 13 14 15 16
orth(text) now of Sunday Is How is Ka
lemma(Uninflected word) now of Sunday Is How is Ka
reading_form(Reading kana) Condo No Nichiyoubi C Doe death Mosquito
pos(PartOfSpeech) NOUN ADP NOUN ADP ADV AUX PART PUNCT
pos(Part of speech) noun Set words noun Set words adverb Auxiliary verb Particle Punctuation
tag(Part of speech details) noun-普通noun-Adverbs possible Particle-格Particle noun-普通noun-Adverbs possible Particle-係Particle adverb Auxiliary verb Particle-終Particle Auxiliary symbol-Punctuation
inflection(Utilization information) - - - - - "Auxiliary verb-death End-form-General" -
ent_type(Entity type) - - Day_Of_Week - - - - -
ent_iob(Entity IOB) O O B O O O O O
lang(language) ja ja ja ja ja ja ja ja
dep(dependency) nmod case nsubj case ROOT aux mark punct
dep(Syntax dependency) Noun modifier Case sign Noun phrase subject Case sign ROOT Auxiliary verb Knot sign Punctuation
head.i(Parent index) 11 9 13 11 13 13 13 13
bunsetu_bi_label B I B I B I I I
bunsetu_position_type SEM_HEAD SYN_HEAD SEM_HEAD SYN_HEAD ROOT SYN_HEAD SYN_HEAD CONT
is_bunsetu_head TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
ent_label_ontonotes O O B-DATE O O O O O
ent_label_ene O O B-Day_Of_Week O O O O O

Phrase break

How about next/Sunday /?

Division of head section and phrase of phrase

This time (NP)/Sunday (NP)/How (ADVP)

Named entity (entity)

image.png


Is it easier to understand syntactic dependencies? I hope you are interested in GiNZA as much as possible.

Reference site

Recommended Posts

I tried to display the analysis result of the natural language processing library GiNZA in an easy-to-understand manner
I tried to extract named entities with the natural language processing library GiNZA
[Natural language processing] I tried to visualize the remarks of each member in the Slack community
[Python] I tried to summarize the set type (set) in an easy-to-understand manner.
[Natural language processing] I tried to visualize the hot topics this week in the Slack community
[For beginners] I want to explain the number of learning times in an easy-to-understand manner.
[Deep Learning from scratch] I tried to explain the gradient confirmation in an easy-to-understand manner.
Sentiment analysis with natural language processing! I tried to predict the evaluation from the review text
I tried to make an analysis base of 5 patterns in 3 years
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 1
I tried to understand supervised learning of machine learning in an easy-to-understand manner even for server engineers 2
I tried to solve the 2020 version of 100 language processing [Chapter 3: Regular expressions 25-29]
I tried to summarize Cpaw Level1 & Level2 Write Up in an easy-to-understand manner
I tried to summarize Cpaw Level 3 Write Up in an easy-to-understand manner
I tried to display the altitude value of DTM in a graph
I tried to solve the 2020 version of 100 language processing knocks [Chapter 3: Regular expressions 20 to 24]
I want to store the result of% time, %% time, etc. in an object (variable)
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 00-04]
I tried to solve the 2020 version of 100 language processing knocks [Chapter 1: Preparatory movement 05-09]
I tried to explain how to get the article content with MediaWiki API in an easy-to-understand manner with examples (Python 3)
[For beginners] Language analysis using the natural language processing tool "GiNZA" (from morphological analysis to vectorization)
I tried to verify whether the Natural Language API (sentiment analysis) supports net slang.
I tried to illustrate the time and time in C language
I will explain how to use Pandas in an easy-to-understand manner.
[Word2vec] Let's visualize the result of natural language processing of company reviews
[Python] I tried to explain words that are difficult for beginners to understand in an easy-to-understand manner.
I tried natural language processing with transformers.
I tried to get the batting results of Hachinai using image processing
Comparing the basic grammar of Python and Go in an easy-to-understand manner
I want to batch convert the result of "string" .split () in Python
I want to leave an arbitrary command in the command history of Shell
I tried to verify the result of A / B test by chi-square test
I tried cluster analysis of the weather map
View the result of geometry processing in Python
[Python] [Natural language processing] I tried Deep Learning ❷ made from scratch in Japanese ①
I tried to touch the API of ebay
I tried to correct the keystone of the image
I made an appdo command to execute a command in the context of the app
Unbearable shortness of Attention in natural language processing
I want to display the progress in Python!
I tried to display the infection condition of coronavirus on the heat map of seaborn
I tried to predict the price of ETF
I tried to vectorize the lyrics of Hinatazaka46!
The first artificial intelligence. I wanted to try natural language processing, so I will try morphological analysis using MeCab with python3.
[Linux] I learned LPIC lv1 in 10 days and tried to understand the mechanism of Linux.
I tried to classify Mr. Habu and Mr. Habu with natural language processing × naive Bayes classifier
How to display the modification date of a file in C language up to nanoseconds
I tried to extract the text in the image file using Tesseract of the OCR engine
I tried to put HULFT IoT (Agent) in the gateway Rooster of Sun Electronics
How to limit the API to be published in the C language shared library of Linux
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to graph the packages installed in Python
I read an introductory book on natural language processing
I want to grep the execution result of strace
I tried to summarize the basic form of GPLVM
I tried to predict the J-League match (data analysis)
Performance verification of data preprocessing in natural language processing
I tried to visualize the spacha information of VTuber
I tried to erase the negative part of Meros
I tried to identify the language using CNN + Melspectogram