It is a challenge record of Language processing 100 knock 2015. The environment is Ubuntu 16.04 LTS + Python 3.5.2 : : Anaconda 4.1.1 (64-bit). Click here for a list of past knocks (http://qiita.com/segavvy/items/fb50ba8097d59475f760).
Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.
main.py
# coding: utf-8
import CaboCha
import re
fname = 'neko.txt'
fname_parsed = 'neko.txt.cabocha'
def parse_neko():
'''Parsing "I am a cat"
"I am a cat"(neko.txt)Parsing and analyzing neko.txt.Save to cabocha
'''
with open(fname) as data_file, \
open(fname_parsed, mode='w') as out_file:
cabocha = CaboCha.Parser()
for line in data_file:
out_file.write(
cabocha.parse(line).toString(CaboCha.FORMAT_LATTICE)
)
class Morph:
'''
Morpheme class
Surface form (surface), uninflected word (base), part of speech (pos), part of speech subclassification 1 (pos1)
Have in a member variable
'''
def __init__(self, surface, base, pos, pos1):
'''Initialization'''
self.surface = surface
self.base = base
self.pos = pos
self.pos1 = pos1
def __str__(self):
'''String representation of the object'''
return 'surface[{}]\tbase[{}]\tpos[{}]\tpos1[{}]'\
.format(self.surface, self.base, self.pos, self.pos1)
class Chunk:
'''
Phrase class
List of morphemes (Morph objects) (morphs), destination clause index number (dst),
It has a list (srcs) of the index numbers of the original clause as a member variable.
'''
def __init__(self):
'''Initialization'''
self.morphs = []
self.srcs = []
self.dst = -1
def __str__(self):
'''String representation of the object'''
surface = ''
for morph in self.morphs:
surface += morph.surface
return '{}\tsrcs{}\tdst[{}]'.format(surface, self.srcs, self.dst)
def neco_lines():
'''Generator of dependency analysis results for "I am a cat"
Read the dependency analysis results of "I am a cat" in sequence,
Returns a list of Chunk classes sentence by sentence
Return value:
List of Chunk classes in one sentence
'''
with open(fname_parsed) as file_parsed:
chunks = dict() #Store Chunk with idx as key
idx = -1
for line in file_parsed:
#Judgment of the end of one sentence
if line == 'EOS\n':
#Returns a list of Chunks
if len(chunks) > 0:
#Sort chunks by key and retrieve only value
sorted_tuple = sorted(chunks.items(), key=lambda x: x[0])
yield list(zip(*sorted_tuple))[1]
chunks.clear()
else:
yield []
#The beginning is*Since the line of is the result of dependency analysis, create Chunk
elif line[0] == '*':
#Get Chunk index number and contact index number
cols = line.split(' ')
idx = int(cols[1])
dst = int(re.search(r'(.*?)D', cols[2]).group(1))
#Generate (if not) Chunk and set index number of contact
if idx not in chunks:
chunks[idx] = Chunk()
chunks[idx].dst = dst
#Generate (if not) Chunk of the contact and add the index number of the contact
if dst != -1:
if dst not in chunks:
chunks[dst] = Chunk()
chunks[dst].srcs.append(idx)
#The other lines are morphological analysis results, so create Morph and add it to Chunk.
else:
#The surface layer is tab-delimited, otherwise','Separate by break
cols = line.split('\t')
res_cols = cols[1].split(',')
#Create Morph, add to list
chunks[idx].morphs.append(
Morph(
cols[0], # surface
res_cols[6], # base
res_cols[0], # pos
res_cols[1] # pos1
)
)
raise StopIteration
#Dependency analysis
parse_neko()
#Create a list one sentence at a time
for i, chunks in enumerate(neco_lines(), 1):
#Display the 8th sentence
if i == 8:
for j, chunk in enumerate(chunks):
print('[{}]{}'.format(j, chunk))
break
The problem is "display the contact", but the contact source is also displayed to confirm the implementation of the Chunk class.
Terminal
[0]I'm srcs[] dst[5]
[1]Here srcs[] dst[2]
[2]For the first time srcs[1] dst[3]
[3]Human srcs[2] dst[4]
[4]Things srcs[3] dst[5]
[5]saw. srcs[0, 4] dst[-1]
The dependency analysis result by CaboCha has a line that starts with *
inserted in the morphological analysis result, and the dependency analysis result is output there.
Example of dependency analysis results
* 3 5D 1/2 0.656580
This line is delimited by whitespace and has the following content:
column | meaning |
---|---|
1 | The first column is* .. Indicates that this is a dependency analysis result. |
2 | Phrase number (integer starting from 0) |
3 | Contact number +D |
4 | Head/Function word positions and any number of feature sequences |
5 | Engagement score. In general, the larger the value, the easier it is to engage. |
Only columns 2 and 3 are used in this issue. Please refer to the official site CaboCha / Pumpkin: Yet Another Japanese Dependency Structure Analyzer for the details of the analysis results.
The problem this time was the order in which the Chunk objects were created. For the time being, read neko.txt.cabocha line by line, create the corresponding Chunk object when even one piece of information to be stored in the Chunk object can be obtained, and add the information there if it has already been created. I tried to implement it in the flow of. The order of creating Chunk objects is not the order of appearance, and since the contents are in no particular order because the dictionary is also used, the Chunk objects are sorted by clause number and extracted at the end. I thought after making it, but it may have been better to create Chunk objects in order of clause number without dependency information, and then set the dependency information later. That's all for the 42nd knock. If you have any mistakes, I would appreciate it if you could point them out.
Recommended Posts