I made a code to check Japanese dependency using cabocha. For the time being, I tried to extract the relationship between nouns and adjectives.
Please refer to mima_ita's blog for installing CaboCha in the windows environment. Put Cabocha in Windows and analyze the dependency with Python
I haven't studied natural language professionally, and I haven't researched it in papers, etc., but I will describe the ideas of just amateurs below.
Focusing on nouns, verbs, and adjectives, if you simply consider the relationship with the part of speech of the immediately preceding word, you can divide it into the following patterns.
Based on this, I created a code to get the relationship between nouns and adjectives. The code is shown below.
python
# coding: utf-8
import CaboCha
import xml.etree.ElementTree as ET
from collections import defaultdict
def get_reputation(ET_tree):
flag = None
reputation = defaultdict(str)
for el in ET_tree.findall(".//chunk"):
tok = el.find("tok")
feature = tok.attrib["feature"].strip().split(',')
part = feature[0]
typ = feature[1]
if part=='noun' and \
(typ=='General' or typ=='Proper noun'):
reputation["object"]=tok.text
if part=='adjective': reputation['adjective']=feature[6]
link = el.attrib["link"]
if link=='-1': break
while 1:
res = get_next_chunk(link,part)
if res==None: break
part, typ, word, link = res
if part=='noun' and \
(typ=='General' or typ=='Proper noun'):
reputation["object"]=word
if part=='adjective':
reputation['adjective'] = word
if reputation["object"]!=None:
flag=1
break
if flag==1: break
print reputation["object"]
print reputation["adjective"]
def get_next_chunk(linkid, ex_part):
if linkid=='-1': return None
this_chunk = ET_tree.find(".//chunk[@id='%s']" % linkid)
#print this_chunk.attrib
link = this_chunk.attrib["link"]
tok = this_chunk.find('tok')
feature = tok.attrib["feature"].strip().split(',')
if ex_part=='noun':
if feature[0]=='noun':
return feature[0], feature[1], tok.text, link
elif feature[0]=='verb' or feature[0]=='adjective':
return feature[0], feature[1], feature[6], link
elif ex_part=='verb':
if feature[0]=='noun':
return feature[0], feature[1], tok.text, link
elif ex_part=='adjective':
if feature[0]=='noun':
return feature[0], feature[1], tok.text, link
if __name__=='__main__':
c = CaboCha.Parser('--charset=UTF8')
sentence = "The most interesting article in the June issue of Courier, "Introduction to'Liberal Arts', which is fun to learn," was an article by Hidehiro Ikegami. "How to see a painting" is extremely important, isn't it?"
tree = c.parse(sentence)
#print tree.toString(CaboCha.FORMAT_TREE)
#print tree.toString(CaboCha.FORMAT_LATTICE)
print tree.toString(CaboCha.FORMAT_XML)
ET_tree = ET.fromstring(tree.toString(CaboCha.FORMAT_XML))
get_reputation(ET_tree)
We only get the first tok for each chunk, because if we can divide it into chunks well, the first tok should have a meaningful word.
The most interesting article in the June issue of Courier, "Introduction to" Liberal Arts "to Learn Fun" was Hidehiro Ikegami's article. "How to see a painting" is extremely important, isn't it? 』\
I tried to analyze the sentence with CaboCha. The following is the XML output of the analysis result of CaboCha.
<sentence>
<chunk id="0" link="1" rel="D" score="0.328005" head="0" func="1">
<tok id="0" feature="noun,General,*,*,*,*,*,*,*,Wikipedia">Courier</tok>
<tok id="1" feature="Particle,Attributive,*,*,*,*,of,No,No">of</tok>
</chunk>
<chunk id="1" link="4" rel="D" score="0.018011" head="3" func="3">
<tok id="2" feature="noun,Adverbs possible,*,*,*,*,June,Rokugatsu,Rokugatsu">June</tok>
<tok id="3" feature="noun,suffix,General,*,*,*,issue,Go,Go">issue</tok>
</chunk>
<chunk id="2" link="3" rel="D" score="1.140087" head="5" func="5">
<tok id="4" feature="symbol,Open parentheses,*,*,*,*,「,「,「">「</tok>
<tok id="5" feature="adjective,Independence,*,*,adjective・イ段,Continuous connection,pleasant,Tanoshiku,Tanoshiku">Happily</tok>
</chunk>
<chunk id="3" link="4" rel="D" score="2.891449" head="6" func="6">
<tok id="6" feature="verb,Independence,*,*,Five steps / ba line,Uninflected word,learn,Manab,Manab">learn</tok>
</chunk>
<chunk id="4" link="6" rel="D" score="1.654218" head="10" func="12">
<tok id="7" feature="symbol,Open parentheses,*,*,*,*,『,『,『">『</tok>
<tok id="8" feature="noun,General,*,*,*,*,Liberal arts,Kyoyo,Kyoyo">Liberal arts</tok>
<tok id="9" feature="symbol,Parentheses closed,*,*,*,*,』,』,』">』</tok>
<tok id="10" feature="noun,Change connection,*,*,*,*,getting started,Newmon,Newmon">getting started</tok>
<tok id="11" feature="symbol,Parentheses closed,*,*,*,*,」,」,」">」</tok>
<tok id="12" feature="Particle,格Particle,General,*,*,*,so,De,De">so</tok>
</chunk>
<chunk id="5" link="6" rel="D" score="2.796273" head="13" func="13">
<tok id="13" feature="noun,Adverbs possible,*,*,*,*,Ichiban,Ichiban,Ichiban">Ichiban</tok>
</chunk>
<chunk id="6" link="8" rel="D" score="1.346988" head="16" func="17">
<tok id="14" feature="adjective,Independence,*,*,adjective・アウオ段,Continuous connection,Interesting,Funny,Funny">Interesting</tok>
<tok id="15" feature="Auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta">Ta</tok>
<tok id="16" feature="noun,Non-independent,General,*,*,*,of,No,No">of</tok>
<tok id="17" feature="Particle,係Particle,*,*,*,*,Is,C,Wow">Is</tok>
<tok id="18" feature="symbol,Comma,*,*,*,*,、,、,、">、</tok>
</chunk>
<chunk id="7" link="8" rel="D" score="3.189624" head="21" func="22">
<tok id="19" feature="noun,固有noun,Personal name,Surname,*,*,Ikegami,Ikegami,Ikegami">Ikegami</tok>
<tok id="20" feature="noun,固有noun,Personal name,Name,*,*,British and Western,Hidehiro,Hidehiro">British and Western</tok>
<tok id="21" feature="noun,suffix,Personal name,*,*,*,Mr.,Sun,Sun">Mr.</tok>
<tok id="22" feature="Particle,Attributive,*,*,*,*,of,No,No">of</tok>
</chunk>
<chunk id="8" link="11" rel="D" score="-0.885691" head="23" func="23">
<tok id="23" feature="noun,General,*,*,*,*,article,pheasant,pheasant">article</tok>
<tok id="24" feature="symbol,Kuten,*,*,*,*,。,。,。">。</tok>
</chunk>
<chunk id="9" link="10" rel="D" score="4.381583" head="26" func="27">
<tok id="25" feature="symbol,Open parentheses,*,*,*,*,「,「,「">「</tok>
<tok id="26" feature="noun,General,*,*,*,*,Painting,Kaiga,Kaiga">Painting</tok>
<tok id="27" feature="Particle,Attributive,*,*,*,*,of,No,No">of</tok>
</chunk>
<chunk id="10" link="11" rel="D" score="-0.885691" head="29" func="31">
<tok id="28" feature="verb,Independence,*,*,One step,Continuous form,to see,Mi,Mi">You see</tok>
<tok id="29" feature="noun,suffix,General,*,*,*,How,Kata,Kata">How</tok>
<tok id="30" feature="symbol,Parentheses closed,*,*,*,*,」,」,」">」</tok>
<tok id="31" feature="Particle,格Particle,Collocation,*,*,*,What,Itte,Itte">What</tok>
</chunk>
<chunk id="11" link="-1" rel="D" score="0.000000" head="33" func="36">
<tok id="32" feature="noun,Change connection,*,*,*,*,Transcendence,Chozetsu,Chozetsu">Transcendence</tok>
<tok id="33" feature="noun,Adjectival noun stem,*,*,*,*,Important,Daiji,Daiji">Important</tok>
<tok id="34" feature="Auxiliary verb,*,*,*,Special,Uninflected word,Is,Da,Da">Is</tok>
<tok id="35" feature="Particle,終Particle,*,*,*,*,Yo,Yo,Yo">Yo</tok>
<tok id="36" feature="Particle,終Particle,*,*,*,*,Ne,Ne,Ne">Ne</tok>
<tok id="37" feature="symbol,Kuten,*,*,*,*,。,。,。">。</tok>
</chunk>
</sentence>
The link contained in each chunk shows the association with the chunk id that follows.
I was able to acquire the relationship of "courier" <-" interesting". However, with the above code, you can only get one relationship, so if you have more than one, you need to devise more. ..
We apologize for the inconvenience, but if you make a mistake, we would appreciate it if you could point it out.
Recommended Posts