This article is a sequel to my book Introduction to Python with 100 Knock. I will mainly explain the class using 100 knocks Chapter 5.
First, let's install CaboCha and parse Aozora Bunko's "I am a cat".
!cabocha -f1 < neko.txt > neko.txt.cabocha
!head -n15 neko.txt.cabocha
* 0 -1D 0/0 0.000000
One noun, number, *, *, *, *, one, one, one EOS EOS * 0 2D 0/0 -0.764522 Symbol, blank, *, *, *, *, ,, * 1 2D 0/1 -0.764522 I noun, pronoun, general, *, *, *, I, Wagamama, Wagamama Is a particle, a particle, *, *, *, *, is, ha, wa * 2 -1D 0/2 0.000000 Cat noun, general, *, *, *, *, cat, cat, cat Auxiliary verb, *, *, *, special / da, continuous form, da, de, de Auxiliary verb, *, *, *, five-dan / la line al, uninflected word, al, al, al .. Symbols, Kuten, *, *, *, * ,. ,. ,. EOS
A line like * clause number, destination clause number D
has been added.
In the first place, what is object-oriented programming? If you go to the point of ..., it will lead to controversy, so I will focus on the Python class. I think it will be easier if you know Java a little.
In terms of issues in this chapter, sentences->Phrase->If you try to handle the hierarchical structure of morphemes without using classes, (statement)|Phrase|(Variable) for (morpheme)|Functions) get mixed up and it becomes a big deal.for
The sentence is likely to be confusing. You can use classes to group functions by argument type, and you can reduce the variable scope, which makes coding a little easier.
Let's look at an example of the date data type datetime.date
.
import datetime
#Instantiation
a_day = datetime.date(2022,2,22)
print(a_day)
print(type(a_day))
#Instance variables
print(a_day.year, a_day.month, a_day.day)
#Class variables
print(datetime.date.max, a_day.max)
#Instance method
print(a_day.weekday(), datetime.date.weekday(a_day))
#Class method
print(datetime.date.today())
2022-02-22
<class 'datetime.date'>
2022 2 22
9999-12-31 9999-12-31
1 1
2020-05-06
datetime.date ()
creates an entity (instance) of type datetime.date
and sets (initializes) the value based on the date passed to the argument.
Instance ʻa_dayhas instance-specific values (attributes) that represent the year, month, and day, and these are called instance variables. On the other hand, a value common to all instances of type
datetime.date` is called a class variable. In this example, the maximum possible date value is a class variable. Functions that require an instance are called instance methods, and functions that do not are called class methods.
Note that ʻa_day.weekday ()and
datetime.date.weekday (a_day)` are equivalent for instance methods. Furthermore, think of Python as converting the former to the latter and executing it. By the way, the return value means 0 is Monday and 1 is Tuesday.
The data type blueprint is * class *. Let's actually define the class. After all, is it easy to understand the class that imitates a certain monster?
class Nezumi:
#Class variables
types = ('Denki',)
learnable_moves = {'Denkou Sekka', 'Kaminari', 'Denji'}
base_hp = 35
#Initialization method
def __init__(self, name, level):
#Instance variables
self.name = name
self.level = level
self.learned_moves = []
#Instance method
def learn(self, move):
if move in self.learnable_moves:
self.learned_moves.append((move))
print(f'{self.name}Is new{move}I remembered!')
#Class method
@classmethod
def hatch(cls):
nezumi = cls('mouse or rat', 1)
nezumi.learned_moves.append('Denkou Sekka')
print('The egg was born instead of the mouse!')
return nezumi
#Instantiation
reo = Nezumi('Leo', 44)
#Member variable confirmation
print(reo.name, reo.level, reo.learned_moves, reo.types)
#Instance method call
reo.learn('Kaminari')
Leo 44 [] Denki Leo learned a new Kaminari!
#Class method call
nezu = Nezumi.hatch()
#Instance variable confirmation
print(nezu.name, nezu.level, nezu.learned_moves)
The egg was born instead of the mouse! Mouse 1 ['Denkou Sekka']
__init __ ()
is a method to initialize an instance variable. It is called an initialization method or an initializer. The first argument, self
, represents an instance. And what is defined in the method in the form of self.variable name
becomes an instance variable.
By calling the defined Nezumi
classNezumi ()
, after instantiation (new) internally,__ init__ ()
is executed for the created instance self
.
The instance reo
is assigned to the first argument self
of the instance method. Calling like reo.learn ('Kaminari')
will execute Nezumi.learn (reo,'Kaminari')
. That's why we need this self
.
Class methods are defined with a * decorator * called @ classmethod
. In class methods, it is orthodox to describe a special initialization method. The class object Nezumi
is assigned to the first argument cls
. Therefore, cls ('mouse', 1)
is the same as Nezumi ('mouse', 1)
.
By the way, in the above example, the instance variables are accessed and checked one by one, but if you use the built-in function vars ()
, the list will be returned in dict
format.
A method that is called without permission when a special operation / operation is performed is called a special method. __init __ ()
is one of them. There are [many] special methods (https://docs.python.org/ja/3/reference/datamodel.html#special-method-names), so I won't go into too much detail, but there are the following.
--__str__ ()
: Defines the display string when passed to print ()
etc.
--__repr__ ()
: Define the display string for debugging. You can see it by letting it evaluate in interactive mode or passing it to repr ()
.
--__len__ ()
: Defines the return value when passed to len ()
.
Python doesn't have that feature. It is customary to name it with one underscore in front of it, such as _spam
.
Description of inheritance itself is not that long, and it sometimes appears in deep learning code. However, it is unnecessary in this chapter, and there are many accompanying stories (use, super ()
, namespace, static method
, etc.), so I will omit it. ~~ I brought an example that seems to be easy to explain inheritance. ~~
A module, dataclasses, has been added to easily define classes for data retention in Python 3.7. It defines __init__ ()
and __repr__ ()
without permission. I feel that it can be used in this problem as well, but I will omit it because I may remember type annotations as it is.
Implement the class
Morph
that represents morphemes. This class has surface type (surface
), uninflected word (base
), part of speech (pos
), and part of speech subclassification 1 (pos1
) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list ofMorph
objects, and display the morpheme string of the third sentence.
Below is an example of the answer.
q40.py
import argparse
from itertools import groupby
import sys
class Morph:
"""Read a line from the cabocha lattice format file"""
__slots__ = ('surface', 'pos', 'pos1', 'base')
#I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
def __init__(self, line):
self.surface, temp = line.rstrip().split('\t')
info = temp.split(',')
self.pos = info[0]
self.pos1 = info[1]
self.base = info[6]
@classmethod
def load_cabocha(cls, fi):
"""Generate Morph instance from cabocha lattice format file"""
for is_eos, sentence in groupby(fi, key=lambda x: x == 'EOS\n'):
if not is_eos:
yield [cls(line) for line in sentence if not line.startswith('* ')]
def __str__(self):
return self.surface
def __repr__(self):
return 'q40.Morph({})'.format(', '.join((self.surface, self.pos, self.pos1, self.base)))
def main():
sent_idx = arg_int()
for i, sent_lis in enumerate(Morph.load_cabocha(sys.stdin), start=1):
if i == sent_idx:
# print(*sent_lis)
print(repr(sent_lis))
break
def arg_int():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--number', default='1', type=int)
args = parser.parse_args()
return args.number
if __name__ == '__main__':
main()
!python q40.py -n2 < neko.txt.cabocha
[q40.Morph (, symbol, blank,), q40.Morph (I, noun, pronoun, I), q40.Morph (, particle, particle, ha), q40.Morph (cat, noun, general, cat) ), Q40.Morph (in, particle, *, da), q40.Morph (is, particle, *, is), q40.Morph (., Sign, punctuation ,.)]
Passing the instance variable name to a special class variable called __slots__
saves memory and speeds up attribute search. Instead, you will not be able to add new instance variables from the outside, or you will not be able to get the instance variable list with vars ()
.
The design is that it cannot be instantiated without passing a line of morpheme information. Do you like that area?
If you write a description in a string literal at the beginning of a function definition etc., it will be treated as * docstring *. String literals can be broken with three quotes. The docstring can be referenced on the help ()
function, Jupyter and editors. Also, doctest and pydoc It may be used together with the module.
You can use group by ()
to write elegantly for files whose end of sentence is expressed by ʻEOS. On the other hand, in the problem of this chapter, there is a part like ʻEOS \ nEOS
because of blank lines, so be careful that the way of counting sentences in the problem sentence and the way of counting sentences by group by ()
are different. ..
In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.
Based on the object-oriented spirit, it is better to create a Sentence
class because it has a has-a relationship of sentence-> clause-> morpheme. Certainly, even if you create it, the instance variable is only self.chunks
, and it is not essential because it does not change much from the viewpoint of variable management.
However, since srcs
cannot be set unless the analysis result of one sentence is read to the end, it seems natural to let it be done when initializing the Sentence
object, and sentence level processing (Cabocha file reading, nth sentence) I think it is an advantage to be able to separate reading the analysis result of, receiving n as a command line argument) and sentence-level processing (subsequent problems, especially those asked in 43 and later).
The following is an example of the answer, but it includes the answers to the following questions. Since it is described in * docstring *, skip the code unrelated to No. 41 and refer to it later.
q41.py
from collections import defaultdict
from itertools import groupby, combinations
import sys
from q40 import Morph, arg_int
class Chunk:
"""Read clauses from cabocha lattice format file"""
__slots__ = ('idx', 'dst', 'morphs', 'srcs')
# * 0 2D 0/0 -0.764522
def __init__(self, line):
info = line.rstrip().split()
self.idx = int(info[1])
self.dst = int(info[2].rstrip("D"))
self.morphs = []
self.srcs = []
def __str__(self):
return ''.join([morph.surface for morph in self.morphs])
def __repr__(self):
return 'q41.Chunk({}, {})'.format(self.idx, self.dst)
def srcs_append(self, src_idx):
"""Add the original clause index. Sentence.__init__()Used in."""
self.srcs.append(src_idx)
def morphs_append(self, line):
"""Add a morpheme. Sentence.__init__()Used in."""
self.morphs.append(Morph(line))
def tostr(self):
"""Returns the surface form of the clause with the symbol removed. Used in q42 or later."""
return ''.join([morph.surface for morph in self.morphs if morph.pos != 'symbol'])
def contain_pos(self, pos):
"""Returns whether the part of speech in the clause exists. Used in q43 or later."""
return pos in (morph.pos for morph in self.morphs)
def replace_np(self, symbol):
"""Replace noun phrases in phrases with symbols. For q49."""
morph_lis = []
for pos, morphs in groupby(self.morphs, key=lambda x: x.pos):
if pos == 'noun':
for morph in morphs:
morph_lis.append(symbol)
break
elif pos != 'symbol':
for morph in morphs:
morph_lis.append(morph.surface)
return ''.join(morph_lis)
class Sentence:
"""Read statements from the cabocha lattice format file."""
__slots__ = ('chunks', 'idx')
def __init__(self, sent_lines):
self.chunks = []
for line in sent_lines:
if line.startswith('* '):
self.chunks.append(Chunk(line))
else:
self.chunks[-1].morphs_append(line)
for chunk in self.chunks:
if chunk.dst != -1:
self.chunks[chunk.dst].srcs_append(chunk.idx)
def __str__(self):
return ' '.join([morph.surface for chunk in self.chunks for morph in chunk.morphs])
@classmethod
def load_cabocha(cls, fi):
"""Generate Sentence instance from cabocha lattice format file"""
for is_eos, sentence in groupby(fi, key=lambda x: x == 'EOS\n'):
if not is_eos:
yield cls(sentence)
def print_dep_idx(self):
"""q41.Display the original clause index and the destination clause index"""
for chunk in self.chunks:
print('{}:{} => {}'.format(chunk.idx, chunk, chunk.dst))
def print_dep(self):
"""q42.Display the surface layer of the original clause and the destination clause separated by tabs"""
for chunk in self.chunks:
if chunk.dst != -1:
print('{}\t{}'.format(chunk.tostr(), self.chunks[chunk.dst].tostr()))
def print_noun_verb_dep(self):
"""q43.Extract clauses containing nouns related to clauses containing verbs"""
for chunk in self.chunks:
if chunk.contain_pos('noun') and self.chunks[chunk.dst].contain_pos('verb'):
print('{}\t{}'.format(chunk.tostr(), self.chunks[chunk.dst].tostr()))
def dep_edge(self):
"""For making pydot output a dependency with q44"""
return [(f"{i}: {chunk.tostr()}", f"{chunk.dst}: {self.chunks[chunk.dst].tostr()}")
for i, chunk in enumerate(self.chunks) if chunk.dst != -1]
def case_pattern(self):
"""q45.Verb case pattern extraction"""
for chunk in self.chunks:
for morph in chunk.morphs:
if morph.pos == 'verb':
verb = morph.base
particles = [] #List of particles
for src in chunk.srcs:
#Add the rightmost particle in the segment
particles.extend([word.base for word in self.chunks[src].morphs
if word.pos == 'Particle'][-1:])
particles.sort()
print('{}\t{}'.format(verb, ' '.join(particles)))
#I only use the leftmost verb, so I can get out quickly
break
def pred_case_arg(self):
"""q46.Verb case frame information extraction"""
for chunk in self.chunks:
for morph in chunk.morphs:
if morph.pos == 'verb':
verb = morph.base
particle_chunks = []
for src in chunk.srcs:
# (Particle,Surface layer of the original segment)
particle_chunks.extend([(word.base, self.chunks[src].tostr())
for word in self.chunks[src].morphs if word.pos == 'Particle'][-1:])
if particle_chunks:
particle_chunks.sort()
particles, chunks = zip(*particle_chunks)
else:
particles, chunks = [], []
print('{}\t{}\t{}'.format(verb, ' '.join(particles), ' '.join(chunks)))
break
def sahen_case_arg(self):
"""q47.Functional verb syntax mining"""
#Flag for extracting sa-hen noun + verb
sahen_flag = 0
for chunk in self.chunks:
for morph in chunk.morphs:
if sahen_flag == 0 and morph.pos1 == 'Change connection':
sahen_flag = 1
sahen = morph.surface
elif sahen_flag == 1 and morph.base == 'To' and morph.pos == 'Particle':
sahen_flag = 2
elif sahen_flag == 2 and morph.pos == 'verb':
sahen_wo = sahen + 'To'
verb = morph.base
particle_chunks = []
for src in chunk.srcs:
# (Particle,Surface layer of the original segment)
particle_chunks.extend([(word.base, self.chunks[src].tostr()) for word in self.chunks[src].morphs
if word.pos == 'Particle'][-1:])
for j, part_chunk in enumerate(particle_chunks[:]):
if sahen_wo in part_chunk[1]:
del particle_chunks[j]
if particle_chunks:
particle_chunks.sort()
particles, chunks = zip(*particle_chunks)
else:
particles, chunks = [], []
print('{}\t{}\t{}'.format(sahen_wo + verb, ' '.join(particles), ' '.join(chunks)))
sahen_flag = 0
break
else:
sahen_flag = 0
def trace_dep_path(self):
"""q48.Track dependency paths from clauses containing nouns to root"""
path = []
for chunk in self.chunks:
if chunk.contain_pos('noun'):
path.append(chunk)
d = chunk.dst
while d != -1:
path.append(self.chunks[d])
d = self.chunks[d].dst
yield path
path = []
def print_noun2noun_path(self):
"""q49.Extraction of dependency paths between nouns"""
#List of Chunk lists showing paths from clauses containing nouns to root (1 sentence)
all_paths = list(self.trace_dep_path())
arrow = ' -> '
#A list of a set of clause ids for each path
all_paths_set = [{chunk.idx for chunk in chunks} for chunks in all_paths]
# all_Choose a pair from paths
for p1, p2 in combinations(range(len(all_paths)), 2):
#Find common phrase k
intersec = all_paths_set[p1] & all_paths_set[p2]
len_intersec = len(intersec)
len_smaller = min(len(all_paths_set[p1]), len(all_paths_set[p2]))
#The intersection is not empty and either is not a subset
if 0 < len_intersec < len_smaller:
#Show path
k = min(intersec)
path1_lis = []
path1_lis.append(all_paths[p1][0].replace_np('X'))
for chunk in all_paths[p1][1:]:
if chunk.idx < k:
path1_lis.append(chunk.tostr())
else:
break
path2_lis = []
rest_lis = []
path2_lis.append(all_paths[p2][0].replace_np('Y'))
for chunk in all_paths[p2][1:]:
if chunk.idx < k:
path2_lis.append(chunk.tostr())
else:
rest_lis.append(chunk.tostr())
print(' | '.join([arrow.join(path1_lis), arrow.join(path2_lis),
arrow.join(rest_lis)]))
#Find and display patterns related to nouns from nouns
for chunks in all_paths:
for j in range(1, len(chunks)):
if chunks[j].contain_pos('noun'):
outstr = []
outstr.append(chunks[0].replace_np('X'))
outstr.extend(chunk.tostr() for chunk in chunks[1:j])
outstr.append(chunks[j].replace_np('Y'))
print(arrow.join(outstr))
def main():
sent_id = arg_int()
for i, sent in enumerate(Sentence.load_cabocha(sys.stdin), start=1):
if i == sent_id:
sent.print_dep_idx()
break
if __name__ == '__main__':
main()
!python q41.py -n8 < neko.txt.cabocha
0: This => 1 1: A student is => 7 2: Sometimes => 4 3: We => 4 4: Catch => 5 5: Boil => 6 6: Eat => 7 7: It's a story. => -1
Since the aim of this chapter is to learn class definitions, I will skip the explanation of the following problems.
It's a little advanced story, so you can skip it.
What is worrisome in the code of the Sentence
class is that descriptions such as for chunk in self.chunks
and self.chunks [i]
occur frequently. If you define the following special method, you can use for chunk in self
orself [i]
.
def __iter__(self):
return iter(self.chunks)
def __getitem__(self, key):
return self.chunks[key]
Actually, when turning the list with a for statement, it was internally converted to an iterator by the ʻiter ()function. And the ʻiter ()
function calls the object's __iter () __
method. Therefore, if you write it like this, you can turn the instance of Sentence
itself with a for statement.
Index access is possible by defining __getitem__ ()
.
__getitem__ ()
is defined.Furthermore, if you define the Chunk
class in the same way, you can execute the following for statement for the instance.
for chunk in sentence:
for morph in chunk:
Perhaps the Python wrapper for the dependency parser in the world is also made like this. Also, I think it is more natural to define the methods for q41 and 42 in the Chunk
class, write the above for
statement on the outside, and call it in it.
42 and 43 are omitted because there is nothing special to say.
Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.
I will do my best to install Graphviz and pydot. Many people say that pydot-ng should be used because pydot has not been maintained for a long time, but in my environment it worked with pydot, so I use pydot. I have only used pydot with 100 knocks, so I will not explain it in particular. For implementation, I referred to this blog.
q44.py
import sys
from q40 import arg_int
from q41 import Sentence
import pydot
def main():
sent_id = arg_int()
for i, sent in enumerate(Sentence.load_cabocha(sys.stdin), start=1):
if i == sent_id:
edges = sent.dep_edge()
n = pydot.Node('node')
n.fontname="MS Gothic"
n.fontsize = 9
graph = pydot.graph_from_edges(edges, directed=True)
graph.add_node(n)
graph.write_jpeg(f"dep_tree_neko{i}.jpg ")
break
if __name__ == "__main__":
main()
!python q44.py -n8 < neko.txt.cabocha
from IPython.display import Image
Image("dep_tree_neko8.jpg ")
There is a way to reverse the direction of the dependency arrow (especially based on Universal Dependencies), but I think either one is fine for this problem.
(Due to the specification of graph_from_edges ()
, if there are multiple clauses with the same surface layer (and different ids) in the statement, they will be regarded as the same node. To avoid that, graph_from_edges ()
It is necessary to change the implementation of or add an id to the surface layer system.)
(The font is MS P Gothic because my environment is Ubuntu with WSL and I couldn't find the Japanese font, so I used the technique of referencing the Windows font.)
I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.
--In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces.
Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.
See `` ` Save the output of this program to a file and check the following items using UNIX commands. --Combination of predicates and case patterns that frequently appear in the corpus --The case pattern of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)
If you want to investigate case, you should focus on case particles, but follow the crying problem statement. Arrange the particles in lexicographic order. Whether or not to output the case where there is no particle in the predicate's element is not written in the problem sentence, but since it is a case where the case is omitted, I decided to aggregate it.
Python's sorted ()
and list.sorted ()
can also sort character strings, but they are lexicographic orders based on code points and are different from the alphabetical order used in Japanese dictionaries. Let's know.
>>> sorted(['Science', 'Scarecrow'])
['Scarecrow','Scarecrow']
The example solution above uses ʻextend (). A method that concatenates lists with the same result as
+. Be careful not to confuse
list.append with
list.extend`.
!python q45.py < neko.txt.cabocha | head
Be born Tsukuka To do By crying To do At the beginning To see Listen To catch Boil
!python q45.py < neko.txt.cabocha | sort | uniq -c | sort -rn | head -20
704 452 435 333 I think 202 To become 199 188 175 Look 159 122 say 117 113 108 98 see 97 When you see 94 90 89 85 80 see
!python q45.py < neko.txt.cabocha | grep -E "^(To do|to see|give)\s" | sort | uniq -c | sort -nr | head -20
452 435 188 175 Look 159 117 113 98 see 90 85 80 see 61 60 60 51 51 from 46 40 39 What is 37
Since the description such as ʻif list:is used in the answer example of No. 46, I will explain the truth value judgment. You can write any object other than the conditional expression in the if statement, and what happens to the truth value judgment in that case? [Documentation](https://docs.python.org/ja/3/library) /stdtypes.html#truth), so please refer to it. In short,
None, zero-like things, and
xwith
len (x) == 0are all treated as False, and others are treated as True. Knowing this makes it easy to write things like "when the list isn't empty". If you are not sure, try applying the built-in function
bool ()` to various objects.
There is nothing else to say, so 46 is omitted.
I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.
-Only when the phrase consisting of "Sahen connecting noun + (particle)" is related to the verb -The predicate is "Sahen connection noun + is the basic form of + verb", and if there are multiple verbs in the phrase, the leftmost verb is used. -If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. -If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).
For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place."
`` `When you reply, the owner says ```
Save the output of this program to a file and check the following items using UNIX commands. -Predicates that frequently appear in the corpus (sa-variant noun + + verb) -Predicates and particles that frequently appear in the corpus
A s-irregular connection noun is a noun that can be made into a s-irregular verb by adding "to" at the end, such as "answer". By the way, in school grammar, "reply" is one word, but in morphological analysis, it is usually divided into "reply / reply".
In addition, please forgive me that the answer example is a terrible chord.
!python q47.py < neko.txt.cabocha | cut -f1 | sort | uniq -c | sort -nr | head
30 reply 21 Say hello 14 imitate 13 talk 13 quarrel 6 Take a nap 5 Exercise 5 Ask a question 5 Ask a question 5 Listen to the story
!python q47.py < neko.txt.cabocha | cut -f 1,2 | sort | uniq -c | sort -nr | head
8 imitate 6 When you reply 6 quarrel 4 exercise 4 To reply 4 Reply 4 Listen to the story 4 When you say hello 4 I'll say hello 3 To ask a question
For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications. -Each clause is represented by a (superficial) morpheme sequence -Concatenate the expressions of each clause with `` `"-> "``` from the start clause to the end clause of the path.
From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
I am->saw here->Start with->Human->Things->saw Human->Things->saw Things->saw
All you have to do is follow the destination with while or recursion. With the following problem in mind, I made a method only for the processing up to one step before the output.
q48.py
import sys
from q40 import arg_int
from q41 import Sentence
def main():
sent_id = arg_int()
for i, sent in enumerate(Sentence.load_cabocha(sys.stdin), start=1):
if i == sent_id:
for chunks in sent.trace_dep_path():
print(' -> '.join([chunk.tostr() for chunk in chunks]))
break
if __name__ == '__main__':
main()
!python q48.py -n6 < neko.txt.cabocha
I saw-> Here-> for the first time-> human-> I saw something-> I saw a human-> thing-> I saw something->
(Added on 2020/5/18) After receiving comments, I noticed that the example of dependency analysis of the problem statement was an analysis error. In issue, we are planning to modify the example sentences used. The main purpose of Chapter 5 of 100 knocks is to practice class definition, and I think that the reason why CaboCha is specified is simply because it is easy to use. On the other hand, it is also considered to devise a problem statement to give the dependency analyzer a wider range. GiNZA is also easy to use these days.
Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, the phrase number of the noun phrase pair is i When> and j (i <j), the dependency path shall satisfy the following specifications.
-Similar to Problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "->". -Replace noun phrases in clauses i and j with X and Y, respectively
In addition, the shape of the dependency path can be considered in the following two ways.
-If clause j exists on the path from clause i to the root of the syntax tree: Show the path from clause i to clause j -Other than the above, when clause i and clause j intersect at a common clause k on the path from clause j to the root of the syntax tree: the path immediately before clause i to clause k and the path from clause j to just before clause k> Display the contents of clause k by concatenating them with "|"
For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.
X is|In Y->Start with->Human->Things|saw X is|Called Y->Things|saw X is|Y|saw In X->Start with-> Y In X->Start with->Human-> Y Called X-> Y
By the way, this is the most difficult problem in the 2020 version of 100 knocks. First of all, the problem statement is very difficult to understand. And even if you understand the meaning, it is still difficult. For the time being, let's understand just the meaning of the problem statement.
In short, the problem is to convert the output of problem 48 to something like this.
However, there are three traps below the output example that correspond to "When clause j exists on the path from clause i to the root of the syntax tree: Display the path from clause i to clause j". Fortunately, this part seems easy to implement. Since the path in question 48 is a noun phrase start, the first clause of all four paths is a candidate for i. All you have to do is search for j from there. And finally, replace the noun phrases i and j with X and Y (I think X-> Y would be
instead of X-> Y
, but gogogogogogogo).
The part that seems to be difficult is "other than the above". This is "I am->"I saw it" and "human->Things->If there is a path like "I saw it", combine it and say "I am|Human->Things|As "I saw", further clause i,Replace the noun phrase of j with "X is|Called Y->Things|"I saw it."
Since there are four paths in Problem 48, we need to choose two paths from them and run the loop 4C2
times. It's painful if you don't know ʻitertools.combinations () `. And if you don't ignore the pairs that are in a subset relationship, such as "human-> things-> saw" and "things-> saw", it will be bad. This process is annoying. Also, for clauses i and j, the first clause of the two paths is a candidate, but the process of searching for a common phrase k from there and the process of creating an output string after it is found are troublesome. ~~ It's neither an introduction to natural language processing nor an introduction to Python, so I don't think it's necessary to force it.
--Class
This completes the introductory series to Python with 100 knocks of language processing?
Recommended Posts