[PYTHON] Answers and impressions of 100 language processing knocks-Part 2

advent calendar 17th day I was late. .. ..

this is

Since I solved 100 knocks on language processing, I will write the answer and impression one by one (second part) It took a lot longer than last time ~~~

Prerequisites

For the environment etc. here (previous link)

Main story

Chapter 5: Dependency Analysis

Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

import CaboCha
c = CaboCha.Parser()
with open('./neko.txt') as f:
    text = f.read()
with open('./neko.txt.cabocha', mode='w') as f:
    for se in  [s + '。' for s in text.split('。')]:
        f.write(c.parse(se ).toString(CaboCha.FORMAT_LATTICE))

Actually, it's the first time I've done a dependency analysis, so I searched a lot with this alone.

40 Reading of dependency analysis results (morpheme)

Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.

# 40
class Morph:
    def __init__(self, surface, base, pos, pos1):
        self.surface = surface
        self.base = base
        self.pos = pos
        self.pos1 = pos1

doc = []
with open('./neko.txt.cabocha') as f:
    sentence = []
    line = f.readline()
    while(line):
        while('EOS' not in line):
            if not line.startswith('*'):
                cols = line.split('\t')
                m = Morph(
                    surface=cols[0],
                    base=cols[1].split(',')[-3],
                    pos=cols[1].split(',')[0],
                    pos1=cols[1].split(',')[1],
                )
                sentence.append(m)
            line = f.readline()
        doc.append(sentence)
        sentence = []
        line = f.readline()
print([t.surface for t in doc[2]])

Do it while checking the output format of CaboCha

41 Reading the dependency analysis result (phrase / dependency)

In addition to> 40, implement the clause class Chunk. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.

# 41
class Chunk:
    def __init__(self, morphs, dst, srcs):
        self.morphs = morphs
        self.dst = dst
        self.srcs = srcs
        
doc = []
with open('./neko.txt.cabocha') as f:
    sentence = []
    line = f.readline()
    while(line):
        if line.startswith('*'):
            cols = line.split(' ')
            #Don't put anything one line above the previous EOS
            if cols[1] != '0':
                sentence.append(c)
            c = Chunk(
                morphs=[],
                dst=int(cols[2].split('D')[0]),
                srcs=[]
            )
        elif 'EOS' in line:
            sentence.append(c)
            #The process of finding something that depends on you
            for i, c in enumerate(sentence):
                c.srcs = [idx for idx, chk, in enumerate(sentence) if chk.dst == i ]
                
            doc.append(sentence)
            sentence = []
        else:
            cols = line.split('\t')
            if cols[1].split(',')[0] != "symbol":
                m = Morph(
                    surface=cols[0],
                    base=cols[1].split(',')[-3],
                    pos=cols[1].split(',')[0],
                    pos1=cols[1].split(',')[1],
                )
                c.morphs.append(m)
        line = f.readline()
for c in doc[7]:
    print(c.dst, end=', ')
    for m in c.morphs:
        print(m.surface, end="")
    print()
for c in doc[0]:
    print(c.dst, end=', ')
    for m in c.morphs:
        print(m.surface)
        print(m.pos)
    print()

I was thinking about how to find the phrase that was related to me, but I couldn't think of a good one, so I simply looped and scanned.

42 Display of the phrase of the person concerned and the person concerned

Extract all the text of the original clause and the relationed clause in tab-delimited format. However, do not output symbols such as punctuation marks.

# 42
#All are jupyter(chrome)Because it hardens 50
for i, d in enumerate(doc[:50]):
    for c in d:
        if int(c.dst) == -1:
            continue
        for m in c.morphs:
            if m.pos == 'symbol':
                continue
            print(m.surface, end="")
        print('\t', end="")
        for m in d[c.dst].morphs:
            if m.pos == 'symbol':
                continue
            print(m.surface, end="")
        print()

Do it while being aware of sentences, phrases and morphemes

43 Extract clauses containing nouns related to clauses containing verbs

When clauses containing nouns relate to clauses containing verbs, extract them in tab-delimited format. However, do not output symbols such as punctuation marks.

# 43
#All are jupyter(chrome)Because it hardens 50
for i, d in enumerate(doc[:50]):
    for c in d:
        if int(c.dst) == -1:
            continue
        contain_noun = 'noun' in [m.pos for m in c.morphs]
        contain_verb = 'verb' in [m.pos for m in d[c.dst].morphs]
        if contain_noun and contain_verb:
            for m in c.morphs:
                if m.pos == 'symbol':
                    continue
                print(m.surface, end="")
            print('\t', end="")
            for m in d[int(c.dst)].morphs:
                if m.pos == 'symbol':
                    continue
                print(m.surface, end="")
            print()

I simply searched with ʻif`

44 Visualization of dependent trees

Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.

# 44
import random, pathlib
from graphviz import Digraph

f = pathlib.Path('nekocabocha.png')
fmt = f.suffix.lstrip('.')
fname = f.stem
target_doc = random.choice(doc)
target_doc = doc[8]
idx = doc.index(target_doc)

dot = Digraph(format=fmt)
dot.attr("node", shape="circle")

N = len(target_doc)
#Add node
for i in range(N):
    dot.node(str(i), ''.join([m.surface for m in target_doc[i].morphs]))

#Add edge
for i in range(N):
    if target_doc[i].dst >= 0:
        dot.edge(str(i), str(target_doc[i].dst))

# dot.engine = "circo"
dot.filename = filename
dot.render()

print(''.join([m.surface for c in target_doc for m in c.morphs]))
print(dot)
from IPython.display import Image, display_png
display_png(Image(str(f)))

I didn't do much because I thought that visualization would be fun, so I did it for the first time, and Wikipedia was enough to get a rough idea of the DOT language.

45 Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications. In a clause containing a verb, the uninflected word of the leftmost verb is used as a predicate. Case particles related to predicates If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output. Begin See is Save the output of this program to a file and check the following items using UNIX commands. A combination of predicates and case patterns that frequently appear in the corpus Case patterns of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

# 45
with open("neko_verb.txt", mode="w") as f:
    for s in doc:
        for c in s:
            if 'verb' in [m.pos for m in c.morphs]:
                row = c.morphs[0].base
                j_list = []
                for i in c.srcs:
                    if len(s[i].morphs) < 2:
                        continue
                    srclast = s[i].morphs[-1]
                    if srclast.pos == 'Particle':
                        j_list.append(srclast.surface)
                if len(j_list) > 0:
                    j_list.sort()
                    row += "\t" +  " ".join(j_list)
                    f.write(row + "\n")
$ cat neko_verb.txt | sort  | uniq -c  | sort -rn -k 3
$ cat neko_verb.txt | grep "^To do" | sort  | uniq -c  | sort -rn -k 3
$ cat neko_verb.txt | grep "to see" | sort  | uniq -c  | sort -rn -k 3
$ cat neko_verb.txt | grep "give" | sort  | uniq -c  | sort -rn -k 3

Combine the "part of speech" and "reception" so far

46 Extraction of verb case frame information

Modify the program> 45 and output the predicate and case pattern followed by the term (the clause itself related to the predicate) in tab-delimited format. In addition to the 45 specifications, meet the following specifications. The term should be a word string of the clause related to the predicate (there is no need to remove the trailing particle). If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces. Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output. Begin here See what I see

# 46
#Although it is different from the output example, this one meets the specification of the problem
for s in doc:
    for c in s:
        if 'verb' in [m.pos for m in c.morphs]:
            row = c.morphs[0].base
            j_list = []
            c_list = []
            for i in c.srcs:
                if len(s[i].morphs) < 2:
                    continue
                srclast = s[i].morphs[-1]
                if srclast.pos == 'Particle':
                    j_list.append(srclast.surface)
                    c_list.append(''.join([m.surface for m in s[i].morphs]))
            if len(j_list) > 0:
                j_list.sort()
                c_list.sort()
                row += "\t" +  " ".join(j_list) + "\t"+  " ".join(c_list)
                print(row)

If there are multiple clauses related to the predicate, arrange them in the same standard and order as the particles, separated by spaces.

So it is different from the output example, but the clauses are also sorted independently

47 Functional verb syntax mining

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications. Only when the phrase consisting of "sa-hen connection noun + (particle)" is related to a verb The predicate is "Sahen connection noun + is the basic form of + verb", and when there are multiple verbs in a phrase, the leftmost verb is used. If there are multiple particles (phrases) related to the predicate, arrange all the particles in lexicographic order separated by spaces. If there are multiple clauses related to the predicate, arrange all the terms with spaces (align with the order of particles). For example, the following output should be obtained from the sentence, "The master will reply to the letter, even if it comes to another place." When you reply, the master is in the letter Save the output of this program to a file and check the following items using UNIX commands. Predicates that frequently appear in the corpus (sa-variant noun + + verb) Predicates and particles patterns that occur frequently in the corpus

# 47
with open("neko_func_verb.txt", mode="w") as f:
    for s in doc:
        for c in s:
            if 'verb' in [m.pos for m in c.morphs]:
                verb = c.morphs[0].base
                for i in c.srcs:
                    v_head = s[i].morphs[-2:]
                    if len(v_head) < 2:
                        continue
                    if v_head[0].pos1 == "Change connection" and v_head[1].surface == "To":
                        verb = ''.join([m.surface for m in v_head]) + verb
                        joshi_dic = {}

                        for j in c.srcs:
                            if len(s[j].morphs) < 2:
                                continue
                            srclast = s[j].morphs[-1]
                            if srclast.pos == 'Particle' and srclast.surface != "To":
                                joshi_dic[srclast.surface] =  ''.join([m.surface for m in s[j].morphs])

                        if len(joshi_dic.keys()) > 0:
                            joshi_list = list(joshi_dic.keys())
                            joshi_list.sort()
                            row = verb + "\t" +  " ".join(joshi_list) + "\t" + " ".join([joshi_dic[joshi] for joshi in joshi_list])
                            f.write(row + "\n")
$ cat neko_func_verb.txt | sed "s/\t/ /g"| cut -f 1 -d " " | sort | uniq -c  | sort -rn -k 3
$ cat neko_func_verb.txt | sed "s/\t/+/g"| cut -f 1,2 -d "+" | sed "s/+/\t/g" | sort | uniq -c  | sort -rn -k 3

If there are multiple clauses related to the predicate, arrange all the terms separated by spaces (align with the order of particles).

This was a dictionary type, so I made it correspond No matter how many times I read the problem, I can't understand it and start to get in trouble from here

48 Extracting paths from nouns to roots

For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications. Each clause is represented by a (superficial) morpheme sequence From the start clause to the end clause of the path, concatenate the expressions of each clause with "->" From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained. I saw-> saw Here-> First time-> Human-> I saw something-> Human-> I saw something-> I saw something->

# 48
for s in doc:
    for c in s:
        if "noun" in [m.pos for m in c.morphs]:
            row = "".join([m.surface for m in c.morphs])
            chunk_to = c.dst
            if chunk_to == -1:
                continue
            while(chunk_to != -1):
                row += " -> " + "".join([m.surface for m in s[chunk_to].morphs])
                chunk_to = s[chunk_to].dst
            print(row)

This was solved smoothly compared to the previous problem

49 Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the phrase number of the noun phrase pair is i and j (i <j), the dependency path shall satisfy the following specifications. Similar to Problem 48, the path is expressed by concatenating the expressions (surface morpheme strings) of each phrase from the start clause to the end clause with "->". Replace noun phrases in clauses i and j with X and Y, respectively. In addition, the shape of the dependency path can be considered in the following two ways. If clause j exists on the path from clause i to the root of the syntax tree: Show the path from clause i to clause j Other than the above, when clause i and clause j intersect at a common clause k on the path from clause j to the root of the syntax tree: the path immediately before clause i to clause k and the path immediately before clause j to clause k, clause k The contents of are connected by "|" and displayed. For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained. X is|In Y->Start with->Human->Things|saw X is|Called Y->Things|saw X is|Y|saw X-> Start-> Y X-> Start-> Human-> Y X-> Y

# 49
for s in doc:
    # i <No need for the tail because of j
    for i, c in enumerate(s[:-1]):
        if "noun" in [m.pos for m in c.morphs] and c.morphs[-1].pos == "Particle":
            #Find j
            for c_rest in s[i+1:]:
                if "noun" in [m.pos for m in c_rest.morphs] and c_rest.morphs[-1].pos == "Particle":
                    i_clause =  "".join([m.surface if m.pos != "noun" else "X" for m in c.morphs])
                    j_clause =  "".join([m.surface if m.pos != "noun" else "Y" for m in c_rest.morphs])
                    
                    row = i_clause
                    chunk_to = c.dst
                    #Ask for the path to see if j exists on the path
                    kkr_path = [chunk_to]
                    while(kkr_path[-1] != -1):
                        kkr_path.append(s[chunk_to].dst)
                        chunk_to = s[chunk_to].dst
                    
                    if s.index(c_rest) in kkr_path:
                        chunk_to = c.dst
                        while(chunk_to != s.index(c_rest)):
                            row += " -> " + "".join([m.surface for m in s[chunk_to].morphs])
                            chunk_to = s[chunk_to].dst
                        row += " -> " + j_clause
                    else:
                        row += " | " + j_clause
                        chunk_to = c_rest.dst
                        while(s[chunk_to].dst != -1):
                            row += " -> " + "".join([m.surface for m in s[chunk_to].morphs])
                            chunk_to = s[chunk_to].dst
                        row += " | " + "".join([m.surface for m in s[chunk_to].morphs])
                        
                    print(row)

Replace noun phrases in clauses i and j with X and Y, respectively.

The specifications and output examples were also slightly different here, but I replaced the nouns such as "X ga".

Chapter 6: Processing English Text

Perform the following processing on the English text (nlp.txt).

$ wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/nlp.txt

50 sentence breaks

(. Or; or: or? Or!) → Whitespace characters → Consider the pattern of uppercase letters as sentence delimiters, and output the input document in the form of one sentence per line.

# 50
import re
sentence_sep = re.compile(r'(\.|;|:|\?|!) ([A-Z])')

with open("./nlp.txt") as f:
    txt = f.read()
txt = re.sub(sentence_sep, r'\1\n\2', txt)
print(txt)

I solved it while thinking

51 Word cutout

Consider whitespace as word delimiters, take 50 outputs as input, and output in the form of one word per line. However, output a blank line at the end of the sentence.

# 51
def space2return(txt):
    sentence_sep = re.compile(r'(\.|;|:|\?|!)\n([A-Z])')
    txt = re.sub(sentence_sep, r'\1\n\n\2', txt)
    return re.sub(r' ', r'\n', txt)

txt = space2return(txt)
print(txt)

Receives 50 outputs, but responds ad hoc because it behaves a little unexpectedly with line breaks in sentences

52 Stemming

Take the output of> 51 as input, apply Porter's stemming algorithm, and output the word and stem in tab-delimited format. In Python, the stemming module should be used as an implementation of Porter's stemming algorithm.

# 52
from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stem_text(txt):
    for l in txt.split('\n'):
        yield l + '\t' + ps .stem(l)
    
for line in stem_text(txt):
    print(line)

It wasn't in the link provided, so I replaced it. I tried using yield, which I usually avoid, but found that the return value is an iterator.

53 Tokenization

Use Stanford Core NLP to get the analysis result of the input text in XML format. Also, read this XML file and output the input text in the form of one word per line.

$ wget http://nlp.stanford.edu/software/stanford-corenlp-full-2018-10-05.zip
$ unzip stanford-corenlp-full-2018-10-05.zip
$ java -cp "./stanford-corenlp-full-2018-10-05/*" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,parse,lemma,ner,coref -file ./nlp.txt

This guy, rather than suffering from XML parsing See the reference and modify -annotators at java runtime as needed

# 53
import xml.etree.ElementTree as ET

tree = ET.parse("./nlp.txt.xml")
root = tree.getroot()
for token in root.iter("token"):
    print(token.find("word").text)

Is this a way to output the whole and check the tags?

54 Part-of-speech tagging

Read the analysis result XML of Stanford Core NLP and output words, lemmas, and part of speech in tab-delimited format.

# 54
for token in root.iter("token"):
    print(token.find("word").text + "\t" + token.find("lemma").text + "\t" + token.find("POS").text)

Very hard to see

55 Named entity recognition

Extract all personal names in the input text.

# 55
for token in root.iter("token"):
    NERtag = token.find("NER").text
    if NERtag == "PERSON":
        print(token.find("word").text)

I thought it was an implementation but it was tagged

56 Co-reference analysis

Replace the reference expression (mention) in the sentence with the representative reference expression (representative mention) based on the result of the co-reference analysis of Stanford Core NLP. However, when replacing, be careful so that the original reference expression can be understood, such as "representative reference expression (reference expression)".

# 56
rep_dic_list = []

#Making a dictionary
for coreferences in root.findall("document/coreference"):
    for mentions in coreferences:
        for m in mentions:
            if "representative" in m.attrib:
                rep_txt = m.find("text").text
            else:
                tmp_dic = {}
                tmp_dic["sentence"] = m.find("sentence").text
                tmp_dic["start"] = m.find("start").text
                tmp_dic["end"] = m.find("end").text
                tmp_dic["rep_txt"] = rep_txt
                rep_dic_list.append(tmp_dic)
                
#output
for s in root.iter("sentence"):
    rep_sent_list = [rd for rd in rep_dic_list if rd["sentence"] == s.attrib["id"]]
    #Whether the statement needs to be replaced
    if len(rep_sent_list) == 0:
            print(" ".join([token.find("word").text for token in s.iter("token")]), end=" ")
    else:
        for token in s.iter("token"):
            tid = token.attrib["id"]
            rep_token_list = [rd for rd in rep_sent_list if rd["start"] == tid or rd["end"] == tid]
            
            if len(rep_token_list) > 0:
                #Since there is only one, take it out
                rep_dic = rep_token_list[0]
                
                #Decoration
                if tid == rep_dic["start"]:
                    print("「" + rep_dic["rep_txt"] + " (", end=" ")
                if tid == rep_dic["end"]:
                    print(")」", end=" ")
                    
            print(token.find("word").text, end=" ")

I couldn't understand the problem statement and left it here for a long time. Make a dictionary and qualify rather than replace

57 Dependency analysis

Visualize the collapsed-dependencies of Stanford Core NLP as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.

# 57
import random, pathlib
from graphviz import Digraph

f = pathlib.Path('nlp.png')
fmt = f.suffix.lstrip('.')
fname = f.stem

dot = Digraph(format=fmt)
dot.attr("node", shape="circle")

sent_id = 3

for sents in root.findall(f"document/sentences/sentence[@id='{sent_id}']"):
    for deps in sents:
        for dep in deps.findall("[@type='collapsed-dependencies']"):
            for token in dep:
                gvnr = token.find("governor")
                dpnt = token.find("dependent")
                dot.node(gvnr.attrib["idx"], gvnr.text)
                dot.node(dpnt.attrib["idx"], dpnt.text)
                dot.edge(gvnr.attrib["idx"], dpnt.attrib["idx"])


dot.filename = fname
dot.render()

# print(dot)
from IPython.display import Image, display_png
display_png(Image(str(f)))

At first, I passed through governor and dependent, so I didn't understand at all.

58 Tuple extraction

Output the set of "subject predicate object" in tab-delimited format based on the result of the dependency analysis (collapsed-dependencies) of Stanford Core NLP. However, refer to the following for the definitions of subject, predicate, and object. Predicate: A word that has children (dependants) of nsubj and dobj relationships Subject: Child (dependent) that has an nsubj relationship from the predicate Object: A child (dependent) that has a dobj relationship from the predicate

# 58
for sents in root.findall(f"document/sentences/sentence"):
    for deps in sents:
        for dep in deps.findall("[@type='collapsed-dependencies']"):
            nsubj_list = []
            for token in dep.findall("./dep[@type='nsubj']"):
                gvnr = token.find("governor")
                dpnt = token.find("dependent")
                nsubj_list.append( {
                    (gvnr.attrib["idx"], gvnr.text): (dpnt.attrib["idx"], dpnt.text)
                })
            for token in dep.findall("./dep[@type='dobj']"):
                gvnr = token.find("governor")
                dpnt = token.find("dependent")
                dobj_tuple = (gvnr.attrib["idx"], gvnr.text)
                
                if dobj_tuple in [list(nsubj.keys())[0] for nsubj in nsubj_list]:
                    idx =  [list(nsubj.keys())[0] for nsubj in nsubj_list].index( dobj_tuple )
                    jutugo = gvnr.text
                    shugo = nsubj_list[idx][dobj_tuple][1]
                    mokutekigo = dpnt.text
                    print(shugo + "\t" + jutugo + "\t" + mokutekigo)

Policy to create a dictionary once and then search for one that meets the conditions

59 S-expression analysis

Read the result of phrase structure analysis (S-expression) of Stanford Core NLP and display all noun phrases (NP) in the sentence. Display all nested noun phrases as well.

# 59 
import xml.etree.ElementTree as ET
import re

def search_nest(t):
    if isinstance(t[0], str):
        if isinstance(t[1], str):
            if t[0] == "NP":
                print(t[1])
            return t[1]
        else:
            if t[0] == "NP":
                np_list = []
                for i in t[1:]:
                    res = search_nest(i)
                    if isinstance(res, str):
                        np_list.append(search_nest(i))
                if len(np_list) > 0:
                    print(' '.join(np_list))
            else:
                for i in t[1:]:
                    search_nest(i)
    else:
        for i in t:
            search_nest(i)


tree = ET.parse("./nlp.txt.xml")
root = tree.getroot()
sent_id = 30

for parse in root.findall(f"document/sentences/sentence[@id='{sent_id}']/parse"):
    S_str = parse.text
    S_str = S_str.replace("(", "('")
    S_str = S_str.replace(")", "')")
    S_str = S_str.replace(" ", "', '")
    S_str = S_str.replace("'(", "(")
    S_str = S_str.replace(")'", ")")
    exec(f"S_tuple = {S_str[:-2]}")
    search_nest(S_tuple)
    

I couldn't recognize the nesting of () and made the biggest dirty implementation in my life. I hard-coded it into a tuple type and recursively extracted it. Even if I tried regex, I couldn't solve it myself, so I gave priority to the answer.

end

It took about twice as long as chapters 1 to 4 because both chapters were the first libraries and the number of language processing terms increased suddenly. However, the rest are DB, ML, and vector, so I'm not careful.

Recommended Posts

Answers and impressions of 100 language processing knocks-Part 1
Answers and impressions of 100 language processing knocks-Part 2
Overview of natural language processing and its data preprocessing
Types of preprocessing in natural language processing and their power
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
Artificial language Lojban and natural language processing (artificial language processing)
100 Language Processing Knock-59: Analysis of S-expressions
Summary of multi-process processing of script language
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-44: Visualization of Dependent Tree
Natural language processing of Yu-Gi-Oh! Card name-Yu-Gi-Oh!
Language processing 100 knocks-22: Extraction of category names
100 Language Processing Knock-26: Removal of emphasized markup
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
100 language processing knocks 03 ~ 05
100 language processing knocks (2020): 40
100 language processing knocks (2020): 32
100 language processing knocks (2020): 47
100 language processing knocks (2020): 22
100 language processing knocks (2020): 26
100 language processing knocks (2020): 34
100 Language Processing Knock (2020): 28
100 language processing knocks (2020): 42
100 language processing knocks (2020): 29
100 language processing knocks (2020): 49
100 language processing knocks 06 ~ 09
100 language processing knocks (2020): 43
100 language processing knocks (2020): 24
100 language processing knocks (2020): 45
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knocks (2020): 10-19
100 language processing knocks (2020): 30
100 Language Processing Knock-45: Extraction of verb case patterns
100 language processing knocks (2020): 00-09
100 language processing knocks (2020): 31
100 Language Processing Knock (2020): 38
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knocks (2020): 48
100 language processing knocks (2020): 44
100 language processing knocks (2020): 41
100 language processing knocks (2020): 37
100 language processing knock 00 ~ 02
100 language processing knocks (2020): 25
100 language processing knocks (2020): 23
100 language processing knocks (2020): 33
100 language processing knocks (2020): 20
100 language processing knocks (2020): 27
100 language processing knocks (2020): 46
Summary of pickle and unpickle processing of user-defined class
Unbearable shortness of Attention in natural language processing
100 language processing knocks (2020): 21
100 language processing knocks (2020): 36
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
[Language processing 100 knocks 2020] Summary of answer examples by Python
Full-width and half-width processing of CSV data in Python
Language processing 100 knocks-46: Extraction of verb case frame information
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
Performance verification of data preprocessing in natural language processing