This article is intended for people who have never touched spaCy/GiNZA to understand and understand what kind of analysis results will be output.
GiNZA is an open source Japanese processing library based on Universal Dependencies (UD). It is built on the commercial level natural language processing framework under the MIT license spaCy.
If you have Python installed, it's easy to install.
$ pip install -U ginza
Since the ginza command can be used, it can be analyzed as it is.
$ ginza
Let's have lunch together in Ginza. How about next Sunday?
# text =Let's have lunch together in Ginza.
1 Ginza Ginza PROPN Noun-Proprietary noun-Place name-General_ 6 obl _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=Ginza|NE=B-GPE|ENE=B-City
2 in ADP particle-Case particles_ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=De
3 lunch lunch NOUN noun-Common noun-General_ 6 obj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_B|Reading=lunch
4 to ADP particle-Case particles_ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=Wo
5 Your NOUN prefix_ 6 compound _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=CONT|Reading=Go
6 Together Together VERB Noun-Common noun-Can be changed_ 0 root _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=ROOT|Reading=Issho
7 AUX verb to do-Non-independent_ 6 advcl _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=Sa line transformation,Continuous form-General|Reading=Shi
8 Let's AUX auxiliary verb_ 6 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=Auxiliary verb-trout,Will guess form|Reading=Masho
9. .. PUNCT auxiliary symbol-Punctuation_ 6 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。
# text =How about next Sunday?
1 This time NOUN noun-Common noun-Adverbs possible_ 3 nmod _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_I|Reading=Condo
2 ADP particles-Case particles_ 1 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=No
3 Sunday Sunday NOUN noun-Common noun-Adverbs possible_ 5 nsubj _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=SEM_HEAD|NP_I|Reading=Nichiyoubi|NE=B-DATE|ENE=B-Day_Of_Week
4 is the ADP particle-Particle_ 3 case _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=C
5 How about ADV adverb_ 0 root _ SpaceAfter=No|BunsetuBILabel=B|BunsetuPositionType=ROOT|Reading=Doe
6 is AUX auxiliary verb_ 5 aux _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Inf=Auxiliary verb-death,End-form-General|Reading=death
7 or PART particle-Final particle_ 5 mark _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=SYN_HEAD|Reading=Mosquito
8. .. PUNCT auxiliary symbol-Punctuation_ 5 punct _ SpaceAfter=No|BunsetuBILabel=I|BunsetuPositionType=CONT|Reading=。
I was able to analyze it safely, but it's hard to see on the console.
This time, I tried to visualize with spaCy Visualizer and Streamlit to make the syntax dependencies and tables easier to see.
When drawing syntax dependencies, svg is generated via create_manual ()
to replace UD terms such as PROPN, ADP, obl, and advcl with Japanese, and drawn with streamlit.image ()
. ..
input_list = st.text_area("Input string").splitlines()
nlp = spacy.load('ja_ginza')
for input_str in input_list:
doc = nlp(input_str)
for sent in doc.sents:
svg = spacy.displacy.render(create_manual(sent), style="dep", manual=True)
streamlit.image(svg)
The table is drawn with streamlit.table ()
and the named entities are drawn with streamlit.components.v1.html ()
. Click here for the full source code (https://github.com/chai3/ginza-streamlit)
The operation result looks like this.
The input and analysis results are as follows.
Let's have lunch together in Ginza. How about next Sunday? I am a cat. There is no name yet.
i(index) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|---|
orth(text) | Ginza | so | lunch | To | Go | together | Shi | まShiょう | 。 |
lemma(Uninflected word) | Ginza | so | lunch | To | Go | together | To do | Trout | 。 |
reading_form(Reading kana) | Ginza | De | lunch | Wo | Go | Issho | Shi | マShiョウ | 。 |
pos(PartOfSpeech) | PROPN | ADP | NOUN | ADP | NOUN | VERB | AUX | AUX | PUNCT |
pos(Part of speech) | Proprietary noun | Set words | noun | Set words | noun | verb | 助verb | 助verb | Punctuation |
tag(Part of speech details) | noun-固有noun-Place name-General | Particle-格Particle | noun-普通noun-General | Particle-格Particle | prefix | noun-普通noun-Can be changed | verb-Non-independent | 助verb | Auxiliary symbol-Punctuation |
inflection(Utilization information) | - | - | - | - | - | - | Sa line transformation continuous form-General | Auxiliary verb-Mass will guess form | - |
ent_type(Entity type) | City | - | - | - | - | - | - | - | - |
ent_iob(Entity IOB) | B | O | O | O | O | O | O | O | O |
lang(language) | ja | ja | ja | ja | ja | ja | ja | ja | ja |
dep(dependency) | obl | case | obj | case | compound | ROOT | advcl | aux | punct |
dep(Syntax dependency) | Accusative element | Case sign | Object | Case sign | Compound word | ROOT | Adverbial modifier clause | Auxiliary verb | Punctuation |
head.i(Parent index) | 5 | 0 | 5 | 2 | 5 | 5 | 5 | 5 | 5 |
bunsetu_bi_label | B | I | B | I | B | I | I | I | I |
bunsetu_position_type | SEM_HEAD | SYN_HEAD | SEM_HEAD | SYN_HEAD | CONT | ROOT | SYN_HEAD | SYN_HEAD | CONT |
is_bunsetu_head | TRUE | FALSE | TRUE | FALSE | FALSE | TRUE | FALSE | FALSE | FALSE |
ent_label_ontonotes | B-GPE | O | O | O | O | O | O | O | O |
ent_label_ene | B-City | O | O | O | O | O | O | O | O |
Let's have lunch/together in Ginza.
Ginza (NP)/Lunch (NP)/Together (VP)
i(index) | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
---|---|---|---|---|---|---|---|---|
orth(text) | now | of | Sunday | Is | How | is | Ka | 。 |
lemma(Uninflected word) | now | of | Sunday | Is | How | is | Ka | 。 |
reading_form(Reading kana) | Condo | No | Nichiyoubi | C | Doe | death | Mosquito | 。 |
pos(PartOfSpeech) | NOUN | ADP | NOUN | ADP | ADV | AUX | PART | PUNCT |
pos(Part of speech) | noun | Set words | noun | Set words | adverb | Auxiliary verb | Particle | Punctuation |
tag(Part of speech details) | noun-普通noun-Adverbs possible | Particle-格Particle | noun-普通noun-Adverbs possible | Particle-係Particle | adverb | Auxiliary verb | Particle-終Particle | Auxiliary symbol-Punctuation |
inflection(Utilization information) | - | - | - | - | - | "Auxiliary verb-death | End-form-General" | - |
ent_type(Entity type) | - | - | Day_Of_Week | - | - | - | - | - |
ent_iob(Entity IOB) | O | O | B | O | O | O | O | O |
lang(language) | ja | ja | ja | ja | ja | ja | ja | ja |
dep(dependency) | nmod | case | nsubj | case | ROOT | aux | mark | punct |
dep(Syntax dependency) | Noun modifier | Case sign | Noun phrase subject | Case sign | ROOT | Auxiliary verb | Knot sign | Punctuation |
head.i(Parent index) | 11 | 9 | 13 | 11 | 13 | 13 | 13 | 13 |
bunsetu_bi_label | B | I | B | I | B | I | I | I |
bunsetu_position_type | SEM_HEAD | SYN_HEAD | SEM_HEAD | SYN_HEAD | ROOT | SYN_HEAD | SYN_HEAD | CONT |
is_bunsetu_head | TRUE | FALSE | TRUE | FALSE | TRUE | FALSE | FALSE | FALSE |
ent_label_ontonotes | O | O | B-DATE | O | O | O | O | O |
ent_label_ene | O | O | B-Day_Of_Week | O | O | O | O | O |
How about next/Sunday /?
This time (NP)/Sunday (NP)/How (ADVP)
Is it easier to understand syntactic dependencies? I hope you are interested in GiNZA as much as possible.