[PYTHON] 100 natural language processing knocks Chapter 5 Dependency analysis (second half)

A record of solving the problems in the second half of Chapter 5. The target file is neko.txt as shown on the web page.

Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

</ i> 45. Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications. In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. --The case is a particle related to a predicate --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. Save the output of this program to a file and check the following items using UNIX commands. A combination of predicates and case patterns that frequently appear in the corpus Case patterns of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import problem41


def extractVerbPatern(sentence):
    lst = []
    for chunk in sentence:
        if chunk.include_pos('verb'):
            src_chunks = [sentence[src] for src in chunk.srcs]
            src_chunks_case = list(filter(lambda src_chunks: src_chunks.morphs_of_pos1('Case particles'), src_chunks))
            if src_chunks_case:
                lst.append((chunk, src_chunks_case))
    return lst


if __name__ == "__main__":
    f = open("neko.txt.cabocha", "r")
    sentences = problem41.read_chunk(f)
    verbPatterns = []
    for sentence in sentences:
        verbPatterns.append(extractVerbPatern(sentence))

    for verbPattern in verbPatterns:
        for verb, src_chunks in verbPattern:
            v = verb.morphs_of_pos('verb')[-1].base
            ps = [src_chunk.morphs_of_pos1('Case particles')[-1].base for src_chunk in src_chunks]
            p = " ".join(sorted(ps))
            print "%s\t%s" % (v, p)
    f.close()

The following command sorts and outputs the results of the above program in order of appearance frequency. From the top, processing for all verbs, "do", "see", and "give".

python problem45.py | sort | uniq -c | sort -nr
python problem45.py | sort | awk '$1=="To do"{print $0}' | uniq -c | sort -nr
python problem45.py | sort | awk '$1=="to see"{print $0}' | uniq -c | sort -nr
python problem45.py | sort | awk '$1=="give"{print $0}' | uniq -c | sort -nr

</ i> 46. Extraction of verb case frame information

Modify the program> 45 and output the predicate and case pattern followed by the term (the phrase itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications. --The term is a word string of the clause related to the predicate (it is not necessary to remove the particle at the end) --If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import problem41
import problem45

if __name__ == "__main__":
    f = open("neko.txt.cabocha", "r")
    sentences = problem41.read_chunk(f)
    f.close()
    verbPatterns = []
    for sentence in sentences:
        verbPatterns.append(problem45.extractVerbPatern(sentence))

    for verbPattern in verbPatterns:
        for verb, src_chunks in verbPattern:
            col1 = verb.morphs_of_pos('verb')[-1].base
            tmp = [(src_chunk.morphs_of_pos1('Case particles')[-1].base, str(src_chunk)) for src_chunk in src_chunks]
            tmp = sorted(tmp, key=lambda x:x[0])
            col2 = " ".join([col[0] for col in tmp])
            col3 = " ".join([col[1] for col in tmp])
            print "%s\t%s\t%s" % (col1, col2, col3)

</ i> 47. Functional verb syntax mining

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications. ――Only when the phrase composed of "Sahen connecting noun + (particle)" is related to the verb --The predicate is "Sahen connection noun + is the basic form of + verb", and when there are multiple verbs in a phrase, the leftmost verb is used. --If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. --If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).

Save the output of this program to a file and check the following items using UNIX commands. --Predicates that frequently appear in the corpus (sa-hen connection noun + + verb) --Predicates and particles that frequently appear in the corpus

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import problem41
import problem45

def extractSahen(src_chunks):
    for i, src_chunk in enumerate(src_chunks):
        morphs = src_chunk.morphs
        if len(morphs) > 1:
            if morphs[-2].pos1 == "Change connection" and morphs[-1].pos == "Particle" and morphs[-1].base == "To":
                src_chunks.pop(i)
                return src_chunk, src_chunks
    return None

if __name__ == "__main__":
    f = open("neko.txt.cabocha", "r")
    sentences = problem41.read_chunk(f)
    f.close()
    verbPatterns = []
    for sentence in sentences:
        verbPatterns.append(problem45.extractVerbPatern(sentence))

    for verbPattern in verbPatterns:
        for verb, src_chunks in verbPattern:
            sahen_chunks_set = extractSahen(src_chunks)
            if sahen_chunks_set:
                sahen_chunk, other_chunks = sahen_chunks_set
                col1 = str(sahen_chunk) + verb.morphs_of_pos('verb')[-1].base
                tmp = [(other_chunk.morphs_of_pos1('Case particles')[-1].base, str(other_chunk)) for other_chunk in other_chunks]
                tmp = sorted(tmp, key=lambda x: x[0])
                col2 = " ".join([col[0] for col in tmp])
                col3 = " ".join([col[1] for col in tmp])
                print "%s\t%s\t%s" % (col1, col2, col3)

A command that outputs predicates (sa-variant noun + + verb) that frequently appear in the corpus.

python problem47.py | cut -f 1 | sort | uniq -c | sort -nr

A command that outputs predicates and particle patterns that frequently appear in the corpus.

python problem47.py | cut -f 1,2 | sort | uniq -c | sort -nr

</ i> 48. Extracting paths from nouns to roots

For a clause containing all nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications. --Each clause is represented by a (superficial) morpheme sequence --Concatenate the expressions of each clause with "->" from the start clause to the end clause of the path.

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

import problem41

def extractPath(chunk, sentence):
    path = [chunk]
    dst = chunk.dst
    while dst != -1:
        path.append(sentence[dst])
        dst = sentence[dst].dst
    return path

if __name__ == "__main__":
    f = open("neko.txt.cabocha", "r")
    sentences = problem41.read_chunk(f)
    f.close()
    paths = []
    for sentence in sentences:
        for chunk in sentence:
            if chunk.include_pos('noun') and chunk.dst != -1:
                paths.append(extractPath(chunk, sentence))

    for path in paths:
        print " -> ".join([str(chunk) for chunk in path])

</ i> 49. Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the phrase number of the noun phrase pair is i and j (i <j), the dependency path shall satisfy the following specifications. --Similar to Problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "->". --Replace the noun phrases in clauses i and j with X and Y, respectively.

In addition, the shape of the dependency path can be considered in the following two ways. --If clause j exists on the path from clause i to the root of the syntax tree: Show the path from clause i to clause j --Other than the above, when clause i and clause j intersect at a common clause k on the path from clause j to the root of the syntax tree: the path immediately before clause i to clause k, the path immediately before clause j to clause k, and clause Display the contents of k by concatenating them with "|"

# -*- coding: utf-8 -*-
__author__ = 'todoroki'

from collections import namedtuple
from itertools import combinations
import problem41

def extractPathIndex(i_chunk, sentence):
    i, chunk = i_chunk
    path_index = [i]
    dst = chunk.dst
    while dst != -1:
        path_index.append(dst)
        dst = sentence[dst].dst
    return path_index

def posReplace(chunks, pos, repl, k=1):
    replaced_str = ""
    for morph in chunks[0].morphs:
        if morph.pos == pos and k > 0:
            replaced_str += repl
            k -= 1
        else:
            if morph.pos != 'symbol':
                replaced_str += morph.surface
    return [replaced_str] + [str(chunk) for chunk in chunks[1:]]


if __name__ == "__main__":
    f = open("neko.txt.cabocha", "r")
    sentences = problem41.read_chunk(f)
    f.close()
    paths = []
    N2Npath = namedtuple('N2Npath', ['X', 'Y', 'is_linear'])
    for sentence in sentences:
        noun_chunks = [(i, chunk) for i, chunk in enumerate(sentence) if chunk.include_pos('noun')]
        if len(noun_chunks) > 1:
            for former, latter in combinations(noun_chunks, 2):
                f_index = extractPathIndex(former, sentence)
                l_index = extractPathIndex(latter, sentence)
                f_i, l_i = list(zip(reversed(f_index), reversed(l_index)))[-1]
                linear_flag = (f_i == l_i)
                if linear_flag:
                    f_index2 = f_index[:f_index.index(f_i)+1]
                    l_index2 = l_index[:l_index.index(l_i)+1]
                else:
                    f_index2 = f_index[:f_index.index(f_i)+2]
                    l_index2 = l_index[:l_index.index(l_i)+2]
                X = [sentence[k] for k in f_index2]
                Y = [sentence[k] for k in l_index2]
                paths.append(N2Npath(X=X, Y=Y, is_linear=linear_flag))

    for path in paths:
        x = posReplace(path.X, "noun", "X")
        y = posReplace(path.Y, "noun", "Y")
        if path.is_linear:
            x[-1] = "Y"
            print " -> ".join(x)
        else:
            print "%s | %s | %s" % (" -> ".join(x[:-1]), " -> ".join(y[:-1]), path.X[-1])

Recommended Posts