A record of solving the problems in the second half of Chapter 5. The target file is neko.txt as shown on the web page.
Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.
I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications. In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. Save the output of this program to a file and check the following items using UNIX commands. A combination of predicates and case patterns that frequently appear in the corpus Case patterns of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import problem41
def extractVerbPatern(sentence):
lst = []
for chunk in sentence:
if chunk.include_pos('verb'):
src_chunks = [sentence[src] for src in chunk.srcs]
src_chunks_case = list(filter(lambda src_chunks: src_chunks.morphs_of_pos1('Case particles'), src_chunks))
if src_chunks_case:
lst.append((chunk, src_chunks_case))
return lst
if __name__ == "__main__":
f = open("neko.txt.cabocha", "r")
sentences = problem41.read_chunk(f)
verbPatterns = []
for sentence in sentences:
verbPatterns.append(extractVerbPatern(sentence))
for verbPattern in verbPatterns:
for verb, src_chunks in verbPattern:
v = verb.morphs_of_pos('verb')[-1].base
ps = [src_chunk.morphs_of_pos1('Case particles')[-1].base for src_chunk in src_chunks]
p = " ".join(sorted(ps))
print "%s\t%s" % (v, p)
f.close()
The following command sorts and outputs the results of the above program in order of appearance frequency. From the top, processing for all verbs, "do", "see", and "give".
python problem45.py | sort | uniq -c | sort -nr
python problem45.py | sort | awk '$1=="To do"{print $0}' | uniq -c | sort -nr
python problem45.py | sort | awk '$1=="to see"{print $0}' | uniq -c | sort -nr
python problem45.py | sort | awk '$1=="give"{print $0}' | uniq -c | sort -nr
Modify the program> 45 and output the predicate and case pattern followed by the term (the phrase itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications. --The term is a word string of the clause related to the predicate (it is not necessary to remove the particle at the end) --If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import problem41
import problem45
if __name__ == "__main__":
f = open("neko.txt.cabocha", "r")
sentences = problem41.read_chunk(f)
f.close()
verbPatterns = []
for sentence in sentences:
verbPatterns.append(problem45.extractVerbPatern(sentence))
for verbPattern in verbPatterns:
for verb, src_chunks in verbPattern:
col1 = verb.morphs_of_pos('verb')[-1].base
tmp = [(src_chunk.morphs_of_pos1('Case particles')[-1].base, str(src_chunk)) for src_chunk in src_chunks]
tmp = sorted(tmp, key=lambda x:x[0])
col2 = " ".join([col[0] for col in tmp])
col3 = " ".join([col[1] for col in tmp])
print "%s\t%s\t%s" % (col1, col2, col3)
I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications. ――Only when the phrase composed of "Sahen connecting noun + (particle)" is related to the verb --The predicate is "Sahen connection noun + is the basic form of + verb", and when there are multiple verbs in a phrase, the leftmost verb is used. --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. --If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).
Save the output of this program to a file and check the following items using UNIX commands. --Predicates that frequently appear in the corpus (sa-hen connection noun + + verb) --Predicates and particles that frequently appear in the corpus
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import problem41
import problem45
def extractSahen(src_chunks):
for i, src_chunk in enumerate(src_chunks):
morphs = src_chunk.morphs
if len(morphs) > 1:
if morphs[-2].pos1 == "Change connection" and morphs[-1].pos == "Particle" and morphs[-1].base == "To":
src_chunks.pop(i)
return src_chunk, src_chunks
return None
if __name__ == "__main__":
f = open("neko.txt.cabocha", "r")
sentences = problem41.read_chunk(f)
f.close()
verbPatterns = []
for sentence in sentences:
verbPatterns.append(problem45.extractVerbPatern(sentence))
for verbPattern in verbPatterns:
for verb, src_chunks in verbPattern:
sahen_chunks_set = extractSahen(src_chunks)
if sahen_chunks_set:
sahen_chunk, other_chunks = sahen_chunks_set
col1 = str(sahen_chunk) + verb.morphs_of_pos('verb')[-1].base
tmp = [(other_chunk.morphs_of_pos1('Case particles')[-1].base, str(other_chunk)) for other_chunk in other_chunks]
tmp = sorted(tmp, key=lambda x: x[0])
col2 = " ".join([col[0] for col in tmp])
col3 = " ".join([col[1] for col in tmp])
print "%s\t%s\t%s" % (col1, col2, col3)
A command that outputs predicates (sa-variant noun + + verb) that frequently appear in the corpus.
python problem47.py | cut -f 1 | sort | uniq -c | sort -nr
A command that outputs predicates and particle patterns that frequently appear in the corpus.
python problem47.py | cut -f 1,2 | sort | uniq -c | sort -nr
For a clause containing all nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications. --Each clause is represented by a (superficial) morpheme sequence --Concatenate the expressions of each clause with "->" from the start clause to the end clause of the path.
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
import problem41
def extractPath(chunk, sentence):
path = [chunk]
dst = chunk.dst
while dst != -1:
path.append(sentence[dst])
dst = sentence[dst].dst
return path
if __name__ == "__main__":
f = open("neko.txt.cabocha", "r")
sentences = problem41.read_chunk(f)
f.close()
paths = []
for sentence in sentences:
for chunk in sentence:
if chunk.include_pos('noun') and chunk.dst != -1:
paths.append(extractPath(chunk, sentence))
for path in paths:
print " -> ".join([str(chunk) for chunk in path])
Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the phrase number of the noun phrase pair is i and j (i <j), the dependency path shall satisfy the following specifications. --Similar to Problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "->". --Replace the noun phrases in clauses i and j with X and Y, respectively.
In addition, the shape of the dependency path can be considered in the following two ways. --If clause j exists on the path from clause i to the root of the syntax tree: Show the path from clause i to clause j --Other than the above, when clause i and clause j intersect at a common clause k on the path from clause j to the root of the syntax tree: the path immediately before clause i to clause k, the path immediately before clause j to clause k, and clause Display the contents of k by concatenating them with "|"
# -*- coding: utf-8 -*-
__author__ = 'todoroki'
from collections import namedtuple
from itertools import combinations
import problem41
def extractPathIndex(i_chunk, sentence):
i, chunk = i_chunk
path_index = [i]
dst = chunk.dst
while dst != -1:
path_index.append(dst)
dst = sentence[dst].dst
return path_index
def posReplace(chunks, pos, repl, k=1):
replaced_str = ""
for morph in chunks[0].morphs:
if morph.pos == pos and k > 0:
replaced_str += repl
k -= 1
else:
if morph.pos != 'symbol':
replaced_str += morph.surface
return [replaced_str] + [str(chunk) for chunk in chunks[1:]]
if __name__ == "__main__":
f = open("neko.txt.cabocha", "r")
sentences = problem41.read_chunk(f)
f.close()
paths = []
N2Npath = namedtuple('N2Npath', ['X', 'Y', 'is_linear'])
for sentence in sentences:
noun_chunks = [(i, chunk) for i, chunk in enumerate(sentence) if chunk.include_pos('noun')]
if len(noun_chunks) > 1:
for former, latter in combinations(noun_chunks, 2):
f_index = extractPathIndex(former, sentence)
l_index = extractPathIndex(latter, sentence)
f_i, l_i = list(zip(reversed(f_index), reversed(l_index)))[-1]
linear_flag = (f_i == l_i)
if linear_flag:
f_index2 = f_index[:f_index.index(f_i)+1]
l_index2 = l_index[:l_index.index(l_i)+1]
else:
f_index2 = f_index[:f_index.index(f_i)+2]
l_index2 = l_index[:l_index.index(l_i)+2]
X = [sentence[k] for k in f_index2]
Y = [sentence[k] for k in l_index2]
paths.append(N2Npath(X=X, Y=Y, is_linear=linear_flag))
for path in paths:
x = posReplace(path.X, "noun", "X")
y = posReplace(path.Y, "noun", "Y")
if path.is_linear:
x[-1] = "Y"
print " -> ".join(x)
else:
print "%s | %s | %s" % (" -> ".join(x[:-1]), " -> ".join(y[:-1]), path.X[-1])
Recommended Posts