[PYTHON] [Language processing 100 knocks 2020] Chapter 5: Dependency analysis

Introduction

2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving Chapter 5: Dependency Analysis from Chapters 1 to 10 below. I will.

-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis --Chapter 5: Dependency analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector --Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation

Advance preparation

We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.

Chapter 5: Dependency Analysis

Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

First, download the specified data. If you execute the following command on the cell of Google Colaboratory, the target file will be downloaded to the current directory.

!wget https://nlp100.github.io/data/neko.txt

Next, install CaboCha and MeCab and CRF ++ required to run CaboCha.

#Install MeCab
!apt install mecab libmecab-dev mecab-ipadic-utf8

# CRF++Download / decompress / install the source file of
FILE_ID = "0B4y35FiV1wh7QVR6VXJ5dWExSTQ"
FILE_NAME = "crfpp.tar.gz"
!wget 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O $FILE_NAME
!tar xvf crfpp.tar.gz
%cd CRF++-0.58
!./configure && make && make install && ldconfig
%cd ..

#Download / decompress / install CaboCha source files
FILE_ID = "0B4y35FiV1wh7SDd1Q1dUQkZQaUU"
FILE_NAME = "cabocha-0.69.tar.bz2"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt
!tar -xvf cabocha-0.69.tar.bz2
%cd cabocha-0.69
!./configure -with-charset=utf-8 && make && make check && make install && ldconfig
%cd ..

Once the installation is complete, we will perform a dependency analysis immediately. By executing the following command, the result of dependency analysis of `neko.txt``` will be output as `neko.txt.cabocba```.

!cabocha -f1 -o neko.txt.cabocha neko.txt

CaboCha (Official)

Check the output result.

#Check the number of lines
!wc -l ./neko.txt.cabocha

`output`


297283 ./neko.txt.cabocha

#Check the first 15 lines
!head -15 ./neko.txt.cabocha

`output`


* 0 -1D 0/0 0.000000
One noun,number,*,*,*,*,one,Ichi,Ichi
EOS
EOS
* 0 2D 0/0 -0.764522
symbol,Blank,*,*,*,*,　,　,　
* 1 2D 0/1 -0.764522
I noun,Pronoun,General,*,*,*,I,Wagamama,Wagamama
Is a particle,Particle,*,*,*,*,Is,C,Wow
* 2 -1D 0/2 0.000000
Cat noun,General,*,*,*,*,Cat,cat,cat
Auxiliary verb,*,*,*,Special,Continuous form,Is,De,De
Auxiliary verb,*,*,*,Five steps, La line Al,Uninflected word,is there,Al,Al
.. symbol,Kuten,*,*,*,*,。,。,。
EOS

40. Reading the dependency analysis result (morpheme)

Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.

class Morph:
  def __init__(self, morph):
    surface, attr = morph.split('\t')
    attr = attr.split(',')
    self.surface = surface
    self.base = attr[6]
    self.pos = attr[0]
    self.pos1 = attr[1]

filename = './neko.txt.cabocha'

sentences = []
morphs = []
with open(filename, mode='r') as f:
  for line in f:
    if line[0] == '*':  #Lines representing dependency relationships: Skip
      continue
    elif line != 'EOS\n':  #Other than the end of the sentence: Apply Morph and add to the morpheme list
      morphs.append(Morph(line))
    else:  #End of sentence: Add morpheme list to sentence list
      sentences.append(morphs)
      morphs = []

#Verification
for m in sentences[2]:
  print(vars(m))

`output`


{'surface': '\u3000', 'base': '\u3000', 'pos': 'symbol', 'pos1': 'Blank'}
{'surface': 'I', 'base': 'I', 'pos': 'noun', 'pos1': '代noun'}
{'surface': 'Is', 'base': 'Is', 'pos': 'Particle', 'pos1': '係Particle'}
{'surface': 'Cat', 'base': 'Cat', 'pos': 'noun', 'pos1': 'General'}
{'surface': 'so', 'base': 'Is', 'pos': 'Auxiliary verb', 'pos1': '*'}
{'surface': 'is there', 'base': 'is there', 'pos': 'Auxiliary verb', 'pos1': '*'}
{'surface': '。', 'base': '。', 'pos': 'symbol', 'pos1': 'Kuten'}

41. Reading the dependency analysis result (phrase / dependency)

In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.

A sentence is represented by a list of sentence objects, a sentence object has a list of chunk objects as elements, and a clause object has a hierarchical structure with a list of morphe objects as elements, and is specified here. In addition to the class Chunk of, `Sentence``` is implemented. Note that creating a list of related original clause index numbers ( srcs```), which is an element of the ``` Chunk``` object, requires all the clause information of one sentence, so `` Sentence``` Created when the object is initialized.

class Chunk():
  def __init__(self, morphs, dst):
    self.morphs = morphs
    self.dst = dst
    self.srcs = []


class Sentence():
  def __init__(self, chunks):
    self.chunks = chunks
    for i, chunk in enumerate(self.chunks):
      if chunk.dst != -1:
        self.chunks[chunk.dst].srcs.append(i)

filename = './neko.txt.cabocha'

sentences = []
chunks = []
morphs = []
with open(filename, mode='r') as f:
  for line in f:
    if line[0] == '*':  #Lines representing dependency relationships: Apply Chunk to the information of the previous phrase and add it to the phrase list+Get the contact of the phrase immediately after
      if len(morphs) > 0:
        chunks.append(Chunk(morphs, dst))
        morphs = []
      dst = int(line.split(' ')[2].rstrip('D'))
    elif line != 'EOS\n':  #Other than the end of the sentence: Apply Morph and add to the morpheme list
      morphs.append(Morph(line))
    else:  #End of sentence: Chunk is applied to the information of the previous phrase and added to the phrase list.+Apply Sentence to the phrase list and add it to the sentence list
      chunks.append(Chunk(morphs, dst))
      sentences.append(Sentence(chunks))
      morphs = []
      chunks = []
      dst = None

#Verification
for chunk in sentences[7].chunks:
  print([morph.surface for morph in chunk.morphs], chunk.dst, chunk.srcs)

`output`


['I', 'Is'] 5 []
['here', 'so'] 2 []
['start', 'hand'] 3 [1]
['Human', 'That'] 4 [2]
['thing', 'To'] 5 [3]
['You see', 'Ta', '。'] -1 [0, 4]

As you pointed out in the comments, the result of the dependency analysis of this sentence by CaboCha does not seem to be correct. In this article, we use it as it is according to the instructions in the question text, but please be careful when posting the results.

42. Display of the phrase of the person concerned and the person concerned

Extract all the text of the original clause and the relationed clause in tab-delimited format. However, do not output symbols such as punctuation marks.

After that, unless otherwise specified, the output is confirmed by the 8th sentence as in 41.

sentence = sentences[7]
for chunk in sentence.chunks:
  if int(chunk.dst) != -1:
    modifier = ''.join([morph.surface if morph.pos != 'symbol' else '' for morph in chunk.morphs])
    modifiee = ''.join([morph.surface if morph.pos != 'symbol' else '' for morph in sentence.chunks[int(chunk.dst)].morphs])
    print(modifier, modifiee, sep='\t')

`output`


I saw
For the first time here
For the first time called human
Human beings
I saw something

43. Extract the clauses containing nouns related to the clauses containing verbs

When clauses containing nouns relate to clauses containing verbs, extract them in tab-delimited format. However, do not output symbols such as punctuation marks.

sentence = sentences[7]
for chunk in sentence.chunks:
  if int(chunk.dst) != -1:
    modifier = ''.join([morph.surface if morph.pos != 'symbol' else '' for morph in chunk.morphs])
    modifier_pos = [morph.pos for morph in chunk.morphs]
    modifiee = ''.join([morph.surface if morph.pos != 'symbol' else '' for morph in sentence.chunks[int(chunk.dst)].morphs])
    modifiee_pos = [morph.pos for morph in sentence.chunks[int(chunk.dst)].morphs]
    if 'noun' in modifier_pos and 'verb' in modifiee_pos:
      print(modifier, modifiee, sep='\t')

`output`


I saw
For the first time here
I saw something

44. Visualization of dependent trees

Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.

A graph is created by creating a pair of clauses of the source and destination and passing it to pydot's `` `graph_from_edges```. In addition, since it is not possible to distinguish when the phrase of the same character string appears multiple times in one sentence with the surface layer as it is, it is displayed with an ID at the end.

#Installation of Japanese display font
!apt install fonts-ipafont-gothic

import pydot
from IPython.display import Image,display_png
from graphviz import Digraph
 
sentence = sentences[7]
edges = []
for id, chunk in enumerate(sentence.chunks):
  if int(chunk.dst) != -1:
    modifier = ''.join([morph.surface if morph.pos != 'symbol' else '' for morph in chunk.morphs] + ['(' + str(id) + ')'])
    modifiee = ''.join([morph.surface if morph.pos != 'symbol' else '' for morph in sentence.chunks[int(chunk.dst)].morphs] + ['(' + str(chunk.dst) + ')'])
    edges.append([modifier, modifiee])
n = pydot.Node('node')
n.fontname = 'IPAGothic'
g = pydot.graph_from_edges(edges, directed=True)
g.add_node(n)
g.write_png('./ans44.png')
display_png(Image('./ans44.png'))

ダウンロード (1).png

45. Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.

--In a phrase containing a verb, use the uninflected form of the leftmost verb as a predicate --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces.

Save the output of this program to a file and check the following items using UNIX commands.

--Combination of predicates and case patterns that frequently appear in the corpus --The case pattern of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

with open('./ans45.txt', 'w') as f:
  for sentence in sentences:
    for chunk in sentence.chunks:
      for morph in chunk.morphs:
        if morph.pos == 'verb':  # chunkの左から順番にverbを探す
          cases = []
          for src in chunk.srcs:  #Search for particles from the original chunk of the verb you found
            cases = cases + [morph.surface for morph in sentence.chunks[src].morphs if morph.pos == 'Particle']
          if len(cases) > 0:  #If particles are found, they are sorted and output in lexicographic order after deduplication.
            cases = sorted(list(set(cases)))
            line = '{}\t{}'.format(morph.base, ' '.join(cases))
            print(line, file=f)
          break

#Verification
!cat ./ans45.txt | sort | uniq -c | sort -nr | head -n 10

`output`


2414
1395 Tsukuka
676
608
330 grab
319 see
270 I think
260
238 Therefore
237

!cat ./ans45.txt | grep 'To do' | sort | uniq -c | sort -nr | head -n 5

`output`

!cat ./ans45.txt | grep 'to see' | sort | uniq -c | sort -nr | head -n 5

`output`


344 See
107 See
35 to see
27 Seeing
22 from seeing

!cat ./ans45.txt | grep 'give' | sort | uniq -c | sort -nr | head -n 5

`output`


5 give
3 to give
3 to give
2 Just give
1 to give only to

46. Extraction of verb case frame information

Modify the program> 45 and output the predicate and case pattern followed by the term (the clause itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications.

--The term is a word string of the clause related to the predicate (it is not necessary to remove the particle at the end) --If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.

with open('./ans46.txt', 'w') as f:
  for sentence in sentences:
    for chunk in sentence.chunks:
      for morph in chunk.morphs:
        if morph.pos == 'verb':  # chunkの左から順番にverbを探す
          cases = []
          modi_chunks = []
          for src in chunk.srcs:  #Search for particles from the original chunk of the verb you found
            case = [morph.surface for morph in sentence.chunks[src].morphs if morph.pos == 'Particle']
            if len(case) > 0:  #For chunks containing particles, get particles and terms
              cases = cases + case
              modi_chunks.append(''.join(morph.surface for morph in sentence.chunks[src].morphs if morph.pos != 'symbol'))
          if len(cases) > 0:  #If one or more particles are found, sort them in lexicographic order after deduplication and output them together with the terms.
            cases = sorted(list(set(cases)))
            line = '{}\t{}\t{}'.format(morph.base, ' '.join(cases), ' '.join(modi_chunks))
            print(line, file=f)
          break

#Verification
!cat ./ans46.txt | head -n 10

`output`


Where to be born
I have no idea if it was born
Where to cry
The only thing I was crying
Get started here
See what I see
Listen later
Catch us
Boil and catch
Eat and boil

47. Functional verb syntax mining

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.

――Only when the phrase composed of "Sahen connecting noun + (particle)" is related to the verb The predicate is "Sahen connection noun + is the basic form of + verb", and when there are multiple verbs in a phrase, the leftmost verb is used. --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. --If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).

Save the output of this program to a file and check the following items using UNIX commands.

--Predicates that frequently appear in the corpus (sa-hen connection noun + + verb) --Predicates and particles that frequently appear in the corpus

with open('./ans47.txt', 'w') as f:
  for sentence in sentences:
    for chunk in sentence.chunks:
      for morph in chunk.morphs:
        if morph.pos == 'verb':  # chunkの左から順番にverbを探す
          for i, src in enumerate(chunk.srcs):  #The original chunk of the verb that I found is "Sahen connecting noun"+Check if it is composed of
            if len(sentence.chunks[src].morphs) == 2 and sentence.chunks[src].morphs[0].pos1 == 'Change connection' and sentence.chunks[src].morphs[1].surface == 'To':
              predicate = ''.join([sentence.chunks[src].morphs[0].surface, sentence.chunks[src].morphs[1].surface, morph.base])
              cases = []
              modi_chunks = []
              for src_r in chunk.srcs[:i] + chunk.srcs[i + 1:]:  #Search for particles from the rest of the original chunk
                case = [morph.surface for morph in sentence.chunks[src_r].morphs if morph.pos == 'Particle']
                if len(case) > 0:  #For chunks containing particles, get particles and terms
                  cases = cases + case
                  modi_chunks.append(''.join(morph.surface for morph in sentence.chunks[src_r].morphs if morph.pos != 'symbol'))
              if len(cases) > 0:  #If one or more particles are found, sort them in lexicographic order after deduplication and output them together with the terms.
                cases = sorted(list(set(cases)))
                line = '{}\t{}\t{}'.format(predicate, ' '.join(cases), ' '.join(modi_chunks))
                print(line, file=f)
              break

#Verification
!cat ./ans47.txt | cut -f 1 | sort | uniq -c | sort -nr | head -n 10

`output`


29 reply
21 Say hello
18 talk
9 Ask a question
8 Take a nap
8 quarrel
7 imitate
7 Ask questions
6 Consult
5 Ask a question

!cat ./ans47.txt | cut -f 1,2 | sort | uniq -c | sort -nr | head -n 10

`output`


8 I'll reply
8 Reply
6 to talk
6 to talk
6 Say hello
5 I take a nap
5 I'll say hello
5 To say hello
5 In a quarrel
4 Replying to

48. Extracting paths from nouns to roots

For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications.

--Each clause is represented by a (superficial) morpheme sequence --Concatenate the expressions of each clause from the start clause to the end clause of the path with "->"

sentence = sentences[7]
for chunk in sentence.chunks:
  if 'noun' in [morph.pos for morph in chunk.morphs]:  # chunkがnounを含むか確認
    path = [''.join(morph.surface for morph in chunk.morphs if morph.pos != 'symbol')]
    while chunk.dst != -1:  #Add chunks containing nouns to the list, followed by dst to the root
      path.append(''.join(morph.surface for morph in sentence.chunks[chunk.dst].morphs if morph.pos != 'symbol'))
      chunk = sentence.chunks[chunk.dst]
    print(' -> '.join(path))

`output`


I am->saw
here->Start with->Human->Things->saw
Human->Things->saw
Things->saw

49. Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the clause numbers of the noun phrase pair are $ i $ and $ j $ ($ i $ <$ j $), the dependency path shall satisfy the following specifications.

--Similar to Problem 48, the path expresses the expressions (surface morpheme strings) of each phrase from the start clause to the end clause by concatenating them with "->". --Replace noun phrases in clauses $ i $ and $ j $ with X and Y, respectively.

In addition, the shape of the dependency path can be considered in the following two ways.

--If clause $ j $ exists on the path from clause $ i $ to the root of the syntax tree: Show the path from clause $ i $ to clause $ j $ --Other than the above, when the clause $ i $ and the clause $ j $ intersect at the common clause $ k $ on the path from the clause $ j $ to the root of the syntax tree: The path immediately before the clause $ i $ to the clause $ k $ and the clause $ j The path from $ to just before the clause $ k $ and the contents of the clause $ k $ are connected and displayed with "|"

For example

--i-> a-> b-> j-> If the root is `` `, then i-> a-> b-> j```

i -> a -> k ->root、j -> b -> k ->rootIf,i -> a | j -> b | k

Then, the nouns of $ i $ and $ j $ should be converted to X and Y, respectively, and displayed.

from itertools import combinations

sentence = sentences[7]
nouns = []
for i, chunk in enumerate(sentence.chunks):
  if 'noun' in [morph.pos for morph in chunk.morphs]:  # nounを含む文節を抽出
    nouns.append(i)
for i, j in combinations(nouns, 2):  #Create a path for each pair of clauses containing nouns
  path_i = []
  path_j = []
  while i != j:  # i,Follow dst in order until j reaches the same clause
    if i < j:
      path_i.append(i)
      i = sentence.chunks[i].dst
    else:
      path_j.append(j)
      j = sentence.chunks[j].dst
  if len(path_j) == 0:  #First case
    chunk_X = ''.join([morph.surface if morph.pos != 'noun' else 'X' for morph in sentence.chunks[path_i[0]].morphs])
    chunk_Y = ''.join([morph.surface if morph.pos != 'noun' else 'Y' for morph in sentence.chunks[i].morphs])
    path_XtoY = [chunk_X] + [''.join(morph.surface for morph in sentence.chunks[n].morphs) for n in path_i[1:]] + [chunk_Y]
    print(' -> '.join(path_XtoY))
  else:  #Second case
    chunk_X = ''.join([morph.surface if morph.pos != 'noun' else 'X' for morph in sentence.chunks[path_i[0]].morphs])
    chunk_Y = ''.join([morph.surface if morph.pos != 'noun' else 'Y' for morph in sentence.chunks[path_j[0]].morphs])
    chunk_k = ''.join([morph.surface for morph in sentence.chunks[i].morphs])
    path_X = [chunk_X] + [''.join(morph.surface for morph in sentence.chunks[n].morphs) for n in path_i[1:]]
    path_Y = [chunk_Y] + [''.join(morph.surface for morph in sentence.chunks[n].morphs) for n in path_j[1:]]
    print(' | '.join([' -> '.join(path_X), ' -> '.join(path_Y), chunk_k]))

`output`


X is|In Y->Start with->Human->Things|saw.
X is|Called Y->Things|saw.
X is|Y|saw.
In X->Start with->Called Y
In X->Start with->Human->Y
Called X->Y

in conclusion

100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.