[PYTHON] Dependency analysis with CaboCha

I made a code to check Japanese dependency using cabocha. For the time being, I tried to extract the relationship between nouns and adjectives.

References

Please refer to mima_ita's blog for installing CaboCha in the windows environment. Put Cabocha in Windows and analyze the dependency with Python

Let's briefly summarize the order of part of speech and words

I haven't studied natural language professionally, and I haven't researched it in papers, etc., but I will describe the ideas of just amateurs below.

Focusing on nouns, verbs, and adjectives, if you simply consider the relationship with the part of speech of the immediately preceding word, you can divide it into the following patterns.

Based on this, I created a code to get the relationship between nouns and adjectives. The code is shown below.

Reference code

python


# coding: utf-8

import CaboCha
import xml.etree.ElementTree as ET
from collections import defaultdict

def get_reputation(ET_tree):
    flag = None
    reputation = defaultdict(str)

    for el in ET_tree.findall(".//chunk"):        
        tok = el.find("tok")
        feature = tok.attrib["feature"].strip().split(',')
        part = feature[0]
        typ = feature[1]

        if part=='noun' and \
            (typ=='General' or typ=='Proper noun'): 
                reputation["object"]=tok.text
        if part=='adjective': reputation['adjective']=feature[6]
        link = el.attrib["link"]
        if link=='-1': break
        while 1:
            res = get_next_chunk(link,part)
            if res==None: break
            part, typ, word, link = res

            if part=='noun' and \
                (typ=='General' or typ=='Proper noun'): 
                reputation["object"]=word
            if part=='adjective':
                reputation['adjective'] = word
                if reputation["object"]!=None:
                    flag=1
                    break
        if flag==1: break
        
    print reputation["object"]
    print reputation["adjective"]

def get_next_chunk(linkid, ex_part):
    if linkid=='-1': return None
    this_chunk =  ET_tree.find(".//chunk[@id='%s']" % linkid)
    #print this_chunk.attrib
    link = this_chunk.attrib["link"]
        
    tok = this_chunk.find('tok')
    feature = tok.attrib["feature"].strip().split(',')
    if ex_part=='noun':
        if feature[0]=='noun':
            return feature[0], feature[1], tok.text, link
        elif feature[0]=='verb' or feature[0]=='adjective':
            return feature[0], feature[1],  feature[6], link
    elif ex_part=='verb':
        if feature[0]=='noun':
            return feature[0], feature[1],  tok.text, link
    elif ex_part=='adjective':
        if feature[0]=='noun':
            return feature[0], feature[1],  tok.text, link
        
if __name__=='__main__':
    c = CaboCha.Parser('--charset=UTF8')
    
    sentence = "The most interesting article in the June issue of Courier, "Introduction to'Liberal Arts', which is fun to learn," was an article by Hidehiro Ikegami. "How to see a painting" is extremely important, isn't it?"
    
    tree =  c.parse(sentence)
    #print tree.toString(CaboCha.FORMAT_TREE)
    #print tree.toString(CaboCha.FORMAT_LATTICE)
    print tree.toString(CaboCha.FORMAT_XML)
    ET_tree = ET.fromstring(tree.toString(CaboCha.FORMAT_XML))
    
    get_reputation(ET_tree)

We only get the first tok for each chunk, because if we can divide it into chunks well, the first tok should have a meaningful word.

Output result

The most interesting article in the June issue of Courier, "Introduction to" Liberal Arts "to Learn Fun" was Hidehiro Ikegami's article. "How to see a painting" is extremely important, isn't it? 』\

I tried to analyze the sentence with CaboCha. The following is the XML output of the analysis result of CaboCha.

<sentence>
 <chunk id="0" link="1" rel="D" score="0.328005" head="0" func="1">
  <tok id="0" feature="noun,General,*,*,*,*,*,*,*,Wikipedia">Courier</tok>
  <tok id="1" feature="Particle,Attributive,*,*,*,*,of,No,No">of</tok>
 </chunk>
 <chunk id="1" link="4" rel="D" score="0.018011" head="3" func="3">
  <tok id="2" feature="noun,Adverbs possible,*,*,*,*,June,Rokugatsu,Rokugatsu">June</tok>
  <tok id="3" feature="noun,suffix,General,*,*,*,issue,Go,Go">issue</tok>
 </chunk>
 <chunk id="2" link="3" rel="D" score="1.140087" head="5" func="5">
  <tok id="4" feature="symbol,Open parentheses,*,*,*,*,「,「,「">「</tok>
  <tok id="5" feature="adjective,Independence,*,*,adjective・イ段,Continuous connection,pleasant,Tanoshiku,Tanoshiku">Happily</tok>
 </chunk>
 <chunk id="3" link="4" rel="D" score="2.891449" head="6" func="6">
  <tok id="6" feature="verb,Independence,*,*,Five steps / ba line,Uninflected word,learn,Manab,Manab">learn</tok>
 </chunk>
 <chunk id="4" link="6" rel="D" score="1.654218" head="10" func="12">
  <tok id="7" feature="symbol,Open parentheses,*,*,*,*,『,『,『">『</tok>
  <tok id="8" feature="noun,General,*,*,*,*,Liberal arts,Kyoyo,Kyoyo">Liberal arts</tok>
  <tok id="9" feature="symbol,Parentheses closed,*,*,*,*,』,』,』">』</tok>
  <tok id="10" feature="noun,Change connection,*,*,*,*,getting started,Newmon,Newmon">getting started</tok>
  <tok id="11" feature="symbol,Parentheses closed,*,*,*,*,」,」,」">」</tok>
  <tok id="12" feature="Particle,格Particle,General,*,*,*,so,De,De">so</tok>
 </chunk>
 <chunk id="5" link="6" rel="D" score="2.796273" head="13" func="13">
  <tok id="13" feature="noun,Adverbs possible,*,*,*,*,Ichiban,Ichiban,Ichiban">Ichiban</tok>
 </chunk>
 <chunk id="6" link="8" rel="D" score="1.346988" head="16" func="17">
  <tok id="14" feature="adjective,Independence,*,*,adjective・アウオ段,Continuous connection,Interesting,Funny,Funny">Interesting</tok>
  <tok id="15" feature="Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta">Ta</tok>
  <tok id="16" feature="noun,Non-independent,General,*,*,*,of,No,No">of</tok>
  <tok id="17" feature="Particle,係Particle,*,*,*,*,Is,C,Wow">Is</tok>
  <tok id="18" feature="symbol,Comma,*,*,*,*,、,、,、">、</tok>
 </chunk>
 <chunk id="7" link="8" rel="D" score="3.189624" head="21" func="22">
  <tok id="19" feature="noun,固有noun,Personal name,Surname,*,*,Ikegami,Ikegami,Ikegami">Ikegami</tok>
  <tok id="20" feature="noun,固有noun,Personal name,Name,*,*,British and Western,Hidehiro,Hidehiro">British and Western</tok>
  <tok id="21" feature="noun,suffix,Personal name,*,*,*,Mr.,Sun,Sun">Mr.</tok>
  <tok id="22" feature="Particle,Attributive,*,*,*,*,of,No,No">of</tok>
 </chunk>
 <chunk id="8" link="11" rel="D" score="-0.885691" head="23" func="23">
  <tok id="23" feature="noun,General,*,*,*,*,article,pheasant,pheasant">article</tok>
  <tok id="24" feature="symbol,Kuten,*,*,*,*,。,。,。">。</tok>
 </chunk>
 <chunk id="9" link="10" rel="D" score="4.381583" head="26" func="27">
  <tok id="25" feature="symbol,Open parentheses,*,*,*,*,「,「,「">「</tok>
  <tok id="26" feature="noun,General,*,*,*,*,Painting,Kaiga,Kaiga">Painting</tok>
  <tok id="27" feature="Particle,Attributive,*,*,*,*,of,No,No">of</tok>
 </chunk>
 <chunk id="10" link="11" rel="D" score="-0.885691" head="29" func="31">
  <tok id="28" feature="verb,Independence,*,*,One step,Continuous form,to see,Mi,Mi">You see</tok>
  <tok id="29" feature="noun,suffix,General,*,*,*,How,Kata,Kata">How</tok>
  <tok id="30" feature="symbol,Parentheses closed,*,*,*,*,」,」,」">」</tok>
  <tok id="31" feature="Particle,格Particle,Collocation,*,*,*,What,Itte,Itte">What</tok>
 </chunk>
 <chunk id="11" link="-1" rel="D" score="0.000000" head="33" func="36">
  <tok id="32" feature="noun,Change connection,*,*,*,*,Transcendence,Chozetsu,Chozetsu">Transcendence</tok>
  <tok id="33" feature="noun,Adjectival noun stem,*,*,*,*,Important,Daiji,Daiji">Important</tok>
  <tok id="34" feature="Auxiliary verb,*,*,*,Special,Uninflected word,Is,Da,Da">Is</tok>
  <tok id="35" feature="Particle,終Particle,*,*,*,*,Yo,Yo,Yo">Yo</tok>
  <tok id="36" feature="Particle,終Particle,*,*,*,*,Ne,Ne,Ne">Ne</tok>
  <tok id="37" feature="symbol,Kuten,*,*,*,*,。,。,。">。</tok>
 </chunk>
</sentence>

The link contained in each chunk shows the association with the chunk id that follows.

Experimental result

I was able to acquire the relationship of "courier" <-" interesting". However, with the above code, you can only get one relationship, so if you have more than one, you need to devise more. ..

We apologize for the inconvenience, but if you make a mistake, we would appreciate it if you could point it out.

Recommended Posts

Dependency analysis with CaboCha
Tweet analysis with Python, Mecab and CaboCha
Data analysis with python 2
Voice analysis with python
[Environment construction] Dependency analysis using CaboCha in Python 2.7
Voice analysis with python
Dynamic analysis with Valgrind
Regression analysis with NumPy
Data analysis with Python
[Co-occurrence analysis] Easy co-occurrence analysis with Python! [Python]
100 Language Processing Knock-57: Dependency Analysis
Sentiment analysis with Python (word2vec)
Texture analysis learned with pyradiomics
Planar skeleton analysis with Python
Japanese morphological analysis with Python
Muscle jerk analysis with Python
[PowerShell] Morphological analysis with SudachiPy
Text sentiment analysis with ML-Ask
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
3D skeleton structure analysis with Python
Impedance analysis (EIS) with python [impedance.py]
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
Text mining with Python ① Morphological analysis
Principal component analysis with Spark ML
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
Convenient analysis with Pandas + Jupyter notebook
I played with Mecab (morphological analysis)!
Data analysis starting with python (data visualization 1)
Logistic regression analysis Self-made with python
Data analysis starting with python (data visualization 2)
Put Cabocha 0.68 on Windows and try to analyze the dependency with Python
I tried multiple regression analysis with polynomial regression
The most basic clustering analysis with scikit-learn
Principal Component Analysis with Livedoor News Corpus-Practice-
[In-Database Python Analysis Tutorial with SQL Server 2017]
Marketing analysis with Python ① Customer analysis (decyl analysis, RFM analysis)
Two-dimensional saturated-unsaturated osmotic flow analysis with Python
Machine learning with python (2) Simple regression analysis
2D FEM stress analysis program with Python
I tried factor analysis with Titanic data!
[Voice analysis] Find Cross Similarity with Librosa
Line talk analysis with janome (OSS released)
Sentiment analysis of tweets with deep learning
Principal component analysis with Power BI + Python
Visualize 2ch threads with WordCloud-Morphological analysis / WordCloud-
Data analysis starting with python (data preprocessing-machine learning)
Two-dimensional unsteady heat conduction analysis with Python
Network Analysis with NetworkX --- Community Detection Volume
Python: Simplified morphological analysis with regular expressions
How about polarity analysis with "order" added?