[PYTHON] 100 language processing knock 2020 "for Google Colaboratory"

Since the faculty was reflected from the Gorigori Faculty of Science and from the graduate school to the language processing laboratory, I am a beginner in character string processing. If you have any questions, please report them. You can use it by copying it as it is.

How to install MeCab on Google Collaboration, how to write Matplotlib in Japanese, etc. Updated from time to time

Section 1 [Preparatory Exercise]

1. Reverse order of strings Permalink

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning)

qiita.rb



target='stressed'
def reverse(target):
  length_target = len(target)
  w=target
  rve=''    
  for i in range(length_target):
    rve += w[length_target - i-1]
  return rve

print(reverse(target))
print(target[::-1])

I tried it in two ways. Python had a hard time because I had only touched numerical data processing. If you have free time, you can get on more and more after reviewing!

2. "Patatokukashi"

qiita.rb


#1 of the character string "Patatokukashi",3,5,Take out the 7th character and get the concatenated string.
w='Patatoku Kashii'
w_even = ""
w_odd = ''
j=0
for i in w:
  if ( j % 2 == 0):
    w_even += w[j]
  if (j % 2 == 1):
    w_odd += w[j]
  j+=1

print(w_even,w_odd)

3. "Police car" + "Taxi" = "Patatokukashi" Permalink

qiita.rb


#Get the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

w=''
for i in range(max(len(w_even),len(w_odd))):
  w += w_even[i]
  w += w_odd[i]
w 

04. Element symbol

qiita.rb



s = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
#Is decomposed into words, 1, 5, 6, 7, 8, 9, 15, 16,The 19th word is the first letter, and the other words are the first two letters, and the position of the word from the extracted string.
#Create an associative array (dictionary type or map type) to (what number of words from the beginning).

def Permalink(s):
  s_list = s.split()
  num=len(s_list)
  w={}
  for i in range(num):
    if (i==1-1 or i== 5-1 or i== 6-1 or i== 7-1 or i== 8-1 or i== 9-1 or i== 15-1 or i== 16-1 or i== 19-1):
 #     w.append(s_list[i][0])
       w[i]=(s_list[i][0])
    else:
      w[i]=(s_list[i][0]+s_list[i][1])
  print(w)
  
Permalink(s)
  1. n-gram Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

qiita.rb




def n_gram_word(sentense,n):
  sen=sentense.split()
  w={}
  w=set(w)
  num=len(sen)
  for i in range(num - n + 1):
    w0 = ''
    for j in range(n):
      w0 += sen[i+j]
      w0 += ' '
    w.add(w0)
  return w

def n_gram_moji(sentense,n):
  sentense=sentense.replace(' ','')
  sen=list(sentense)
  w={}
  w=set(w)
  num=len(sen)
  for i in range(num - n + 1):
    w0 = ''
    for j in range(n):
      w0 += sen[i+j]
    w.add(w0)
  return w

s='I am an NLPer'
n_gram_word(s,2)

06. Meeting

Find the set of character bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

qiita.rb




s1 = "paraparaparadise"
s2 = "paragraph"

w1=n_gram_moji(s1,2)
w2=n_gram_moji(s2,2)
print(type(w1))
print('The union is', w1 | w2)
print('The intersection is', w1 & w2)
print('The difference set is', w1 - w2)

W={'se'}
print(W<=w1)
print(W<=w2)

07. Sentence generation by template Permalink

Implement a function that takes arguments x, y, z and returns the string "y at x is z". In addition, x = 12, y = ”temperature”,

qiita.rb


z=22.As 4, check the execution result
def temple(x,y,z):
  print(x,'of time',y,'Is',z)

temple(12,'temperature',22.4)

08. Ciphertext

Implement the function cipher that converts each character of the given character string according to the following specifications. Replace with (219 --character code) characters in lowercase letters Output other characters as they are Use this function to encrypt / decrypt English messages.

qiita.rb



def chipher (s):
#  s=s.lower()
#  s=s.replace(' ', '')
  c=''
  for w in s:
    a = chr(219-ord(w))
    c+=a
  return c

def dechipher(s):
  c=''
  for w in s:
    a=chr(219-ord(w))
    c += a
  return c

s='Neuro Linguistic Programming  has two main definitions: while it began as a set of techniques to understand and codify the underlying elements of genius by modeling the conscious and unconscious behaviors of brilliant communicators and therapists, over the years, it has evolved into a set of frameworks, processes and protocols (the results of modeling) that qualified NLP Practitioners currently use to help evoke effective behavioral changes in clients.'
print(s)
print(chipher(s))
print(dechipher(chipher(s)))
  1. Typoglycemia Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

qiita.rb


import random
def wordc (s):
  s=s.split()
  word=''
  for w in s:
    if len(w)>4:
      a=list(w)
      a.pop(0)
      num=len(w)
      a.pop(num-2)
      aa=list(w)
      random.shuffle(a)
      a.insert(0,aa[0])
      a.insert(len(w),aa[len(w)-1])
      w=''.join(a)
    word += w
    word += ' '
  return word
#s='volcano'
s='Neuro Linguistic Programming  has two main definitions: while it began as a set of techniques to understand and codify the underlying elements of genius by modeling the conscious and unconscious behaviors of brilliant communicators and therapists, over the years, it has evolved into a set of frameworks, processes and protocols (the results of modeling) that qualified NLP Practitioners currently use to help evoke effective behavioral changes in clients.'

print(wordc(s))

The only thing that matters is how to insert MeCab in Chapter 3, so let's get on from there first.

Section 2 [UNIX command]

I wrote it myself, but with reference to other people

Answer and impression of 100 language processing knocks-Part 1

Section 3 [Regular Expression]

There is a file jawiki-country.json.gz that exports Wikipedia articles in the following format.

20. Read JSON data

Read the JSON file of the Wikipedia article and display the article text about "UK". In problems 21-29, execute on the article text extracted here.

qiita.rb


!wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/jawiki-country.json.gz

qiita.rb


## 20
import json, gzip
with gzip.open('jawiki-country.json.gz', 'rt') as f:
    country = json.loads(f.readline())
    while(country):
            country = json.loads(f.readline())
            if country['title'] == "England":
                break

print(country)

qiita.rb


puts 'code with syntax'

The data disappeared ...

Section4 [Morphological analysis]

Use MeCab to morphologically analyze the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.mecab. Use this file to implement a program that addresses the following questions.

For problems 37, 38, and 39, use matplotlib or Gnuplot.

qiita.rb


!wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt
with open ('./neko.txt') as f:
  t = tagger.parse(f.read())
  with open ('./neko.txt.mecab', mode = 'w') as ff:
    ff.write(t)

qiita.rb


!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

You can now install MeCab. Is! apt UNIX-like? ! aptitude is Ubuntu ?? I'm not sure about this. I referred to other sites. Enable MeCab in Colaboratory.

30. Reading morphological analysis results

Implement a program that reads the morphological analysis result (neko.txt.mecab). However, each morpheme is stored in a mapping type with the surface, uninflected word, part of speech (pos), and part of speech subclassification 1 (pos1) as keys, and one sentence is expressed as a list of morphemes (mapping type). Let's do it. For the rest of the problems in Chapter 4, use the program created here.

qiita.rb


with open ('./neko.txt.mecab') as f:
  line = f.readline()
  worddict = dict()
  surface = list()
  pronaunce = list()
  base = list()
  pos = list()
  pos1 = list()
  while 'EOS' not in line:
    t = line.split('\t')
    surface.append(t[0])
    pronaunce.append(t[1])
    base.append(t[2])
    pos.append(t[3])
    pos1.append(t[4])
    line = f.readline()
worddict['surface'] = surface
worddict['pronaunce'] = pronaunce
worddict['base'] = base
worddict['pos'] = pos
worddict['pos1'] = pos1

The data from 30 to 36 disappeared, and it really withered. It's easy, so SKIIIIP!

31. ** Verb Permalink **

Extract all the surface forms of the verb.

32. The original form of the verb Permalink

Extract all the original forms of the verb.

33. "A's B" Permalink

Extract a noun phrase in which two nouns are connected by "no".

34. Noun articulation Permalink

Extract the concatenation of nouns (nouns that appear consecutively) with the longest match The data has disappeared a lot ... I'll do it if I have time

35. Frequency of word occurrence

Find the words that appear in the sentence and their frequency of appearance, and arrange them in descending order of frequency of appearance.

qiita.rb


tango = list()
for i in range(len(worddict['pos'])):
  t = worddict['pos'][i]
  if t[:2] == 'noun':
    tango.append(worddict['surface'][i])
word_d = dict()
for t in set(tango):
  word_d[t] = tango.count(t)
wordsort = sorted(word_d.items(), key = lambda  x:-x[1])
print (wordsort)

#[('of', 1611), ('Thing', 1207), ('もof', 981), ('You', 973), ('master', 932), ('Hmm', 704), ('Yo', 697), ('Man', 602), ('one', 554), ('what', 539), ('I', 481), ('this', 414),

36. Top 10 most frequent words

Display the 10 words that appear frequently and their frequency of appearance in a graph (for example, a bar graph). First, start by acquiring Japanese fonts. Otherwise, countless tofu will appear when graphed.

qiita.rb


!apt-get -y  install fonts-ipafont-gothic
#Delete the cache.
!rm /root/.cache/matplotlib/fontlist-v310.json #Cache to be erased

Erase this -340v. However, it seems to be different for each version, and it was a different value on other sites. Therefore, each person needs to confirm.

qiita.rb


!ls  /root/.cache/matplotlib/
#This directory can be obtained using the following code.
matplotlib.get_cachedir()

qiita.rb



wordhead = dict(wordsort[:10])
print(type(wordhead))
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(font='IPAGothic')

plt.bar(wordhead.keys(),wordhead.values())
plt.show()

I was quite impressed when this was done. You can do it in Japanese. .. .. If you've been in the undergraduate era ...

image.png

37. Top 10 words that frequently co-occur with "cat" Permalink

Display 10 words that often co-occur with "cats" (high frequency of co-occurrence) and their frequency of appearance in a graph (for example, a bar graph).

qiita.rb


cooc = dict()
coocdict = dict()
cooclist = list()
for i in range(len(worddict['surface'])):
  t = worddict['pos'][i]
  if (t[:2] != 'Particle' and t[:2] != 'symbol' and t[:3] != 'Auxiliary verb'):
    cooclist.append(worddict['surface'][i])
for t in set(cooclist) :
  coocdict[t] = 0
#print(cooclist)
for i in range(len(cooclist)):
  if cooclist[i] == 'Cat':
    coocdict[cooclist[i-1]] = coocdict[cooclist[i-1]] + 1
    coocdict[cooclist[i+1]] = coocdict[cooclist[i+1]] + 1
coocsort = sorted (coocdict.items(), key = lambda x: -x[1])
coochead = dict(coocsort[:10])
plt.bar(coochead.keys(), coochead.values())

Dirty code. There are more absolutely good ways. Even though I was in a language processing laboratory, I didn't know the word co-occurrence, so it became a bottleneck. It would be interesting if I could express my sensibilities a little even with word2vec or such distributed expressions. image.png

39. Zipf's Law

Plot a log-log graph with the frequency of occurrence of words on the horizontal axis and the frequency of occurrence on the vertical axis.

qiita.rb


ziplow = dict()
for t in set(worddict['base']):
  ziplow[t] = worddict['base'].count(t)
import math
zipsort = sorted(ziplow.items(), key = lambda x:-x[1])
zipsort[:100]
zipplot= list()
logi=list()
for i in range(len(set(worddict['base']))):
  logi.append(math.log10(i+1))
  zipplot.append(math.log10(zipsort[i][1]))
print(zipplot[:][0])
plt.scatter(logi, zipplot)

I don't think I touched on this in the "Natural Language Processing" class that I attended without permission when I was an undergraduate student. Even though a certain university is taking the initiative in making this, the figure is the same as HP (when the prototype is extracted)

Section 5 [Dependency analysis]

First, install CaboCha.

qiita.rb


#Install Mecab
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

qiita.rb


import os
#Create a specified file
filename_crfpp = 'crfpp.tar.gz'
!wget "https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7QVR6VXJ5dWExSTQ" -O $filename_crfpp
#Extract the archive and move to the specified file
!tar zxvf $filename_crfpp
#Moved to CRF and exists in the current directory.!ls ./Check at
%cd CRF++-0.58
#Execute the script file called configure in this. There is no command with this name
!./configure
#It will tech if it is in the running environment with configure. If cleared, a makefile will be created and run with the make below without options.
!make
!make install
%cd ..
os.environ['LD_LIBRARY_PATH'] += ':/usr/local/lib' 

qiita.rb


FILE_ID = "0B4y35FiV1wh7SDd1Q1dUQkZQaUU"
FILE_NAME = "cabocha.tar.bz2"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt
!tar -xvf cabocha.tar.bz2

%cd cabocha-0.69
!./configure --with-mecab-config=`which mecab-config` --with-charset=UTF8
!make
!make check
!make install
%cd ..
!cabocha --version

Command for using in Python

qiita.rb


%cd cabocha-0.69/python
!python setup.py build_ext
!python setup.py install
!ldconfig
%cd ../..

Use CaboCha to parse the text (neko.txt) of Natsume Soseki's novel "I am a cat" and save the result in a file called neko.txt.cabocha. Use this file to implement a program that addresses the following questions.

qiita.rb


!wget http://www.cl.ecei.tohoku.ac.jp/nlp100/data/neko.txt
import CaboCha
c = CaboCha.Parser()
with open ('neko.txt') as f:
  with open ('neko.txt.cabocha', mode = 'w') as ff:
    line = f.readline()
    while line:
      ff.write(c.parse(line).toString(CaboCha.FORMAT_LATTICE))
      line = f.readline()

40. Reading the dependency analysis result (morpheme)

Implement the class Morph that represents morphemes. This class has surface, uninflected, part of speech (pos), and part of speech subclassification 1 (pos1) as member variables. In addition, read the analysis result of CaboCha (neko.txt.cabocha), express each sentence as a list of Morph objects, and display the morpheme string of the third sentence.

41. Reading the dependency analysis result (phrase / dependency) Permalink

In addition to 40, implement the clause Chunk class. This class has a list of morphemes (Morph objects) (morphs), a list of related clause index numbers (dst), and a list of related original clause index numbers (srcs) as member variables. In addition, read the analysis result of CaboCha of the input text, express one sentence as a list of Chunk objects, and display the character string and the contact of the phrase of the eighth sentence. For the rest of the problems in Chapter 5, use the program created here.

40 and 41 did it all at once. Because it has inheritance in chunks and morphemes.

qiita.rb


class Morph:
  def __init__(self, line):
    line = line.replace('\t',',').split(',')
    self.surface =  line[0]
    #Nya Nya doesn't have an uninflected word, because it can't be processed
    self.base = line[0]
    self.pos = line[1]
    self.pos1 = line[2]

  def out_morph(self):
    print(type(self.surface))

  def listmaker (self):
    t = [self.surface, self.base, self.pos, self.pos1]
    return t

class Chunk(Morph):
  def __init__(self, line):
    l_sp = line[0].split(' ')
    self.srcs = l_sp[1]
    self.dst = l_sp[2]
    m = []
    for i in range(len(line)):
      if i != 0:
        m.append (Morph(line[i]).listmaker())
    self.morph = m
    


  def __str__(self):
    c = [self.srcs, self.dst, self.morph]
    return c

with open ('neko.txt.cabocha') as f:
  text = f.read()
  t = [ r for r in text.split('EOS') if r != '\n']

  for text_EOS in t:
    line_chunk = list()
    a = 0
    b = 0
    num = 2
    t = [ r for r in text_EOS.split('\n') if r != '']
    for t in text_EOS.split('\n'):
      if (a == 1):
        num = 3
      if t != '':
        if a == len(text_EOS.split('\n'))-num:
          line_chunk.append(line)

        elif (t[0] == '*'):
          #Chunk is completed and stored
          if line_chunk != '':
            line_chunk.append(line)
          line = list()
          line.append(t)

        elif(t[0] != '*'):
          line.append(t)

        a = a +1  
    b = b +1
    chunk = [Chunk(c).__str__() for c in line_chunk]
    print(chunk)

Later, I created it in the form of a list for easy analysis. The result of this is [Responsible source, Responsible party, [Chunk]] Represents. This [chunk] part is in the form of the words [surface, base, pos, pos1] that make up the chunk. [Responsible source, Responsible destination, [[surface, base, pos, pos1] [surface, base, pos, pos1]]] For example, like this.

qiita.rb


[['1', '-1D', [['Thank you', 'Thank you', 'adjective', 'Independence']]]]
[['0', '-1D', [['one', 'one', 'noun', 'number']]], ['0', '2D', [['\u3000', '\u3000', 'symbol', 'Blank']]], ['1', '2D', [['I', 'I', 'noun', '代noun'], ['Is', 'Is', 'Particle', '係Particle']]], ['2', '-1D', [['Cat', 'Cat', 'noun', 'one般'], ['so', 'so', 'Auxiliary verb', '*'], ['is there', 'is there', 'Auxiliary verb', '*']]]]
[['2', '-1D', [['Cat', 'Cat', 'noun', 'General'], ['so', 'so', 'Auxiliary verb', '*'], ['is there', 'is there', 'Auxiliary verb', '*']]], ['0', '2D', [['name', 'name', 'noun', 'General'], ['Is', 'Is', 'Particle', '係Particle']]], ['1', '2D', [['yet', 'yet', 'adverb', 'Particle類接続']]], ['2', '-1D', [['No', 'No', 'adjective', 'Independence']]]]
[['2', '-1D', [['No', 'No', 'adjective', 'Independence']]], ['0', '1D', [['\u3000', '\u3000', 'symbol', 'Blank'], ['Where', 'Where', 'noun', '代noun'], ['so', 'so', 'Particle', '格Particle']]], ['1', '4D', [['Born', 'Born', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*'], ['Or', 'Or', 'Particle', '副Particle/並立Particle/終Particle']]], ['2', '4D', [['Tonto', 'Tonto', 'adverb', 'General']]], ['3', '4D', [['Register', 'Register', 'noun', 'Change connection'], ['But', 'But', 'Particle', '格Particle']]], ['4', '-1D', [['つOr', 'つOr', 'verb', 'Independence'], ['Nu', 'Nu', '助verb', '*']]]]
[['4', '-1D', [['Tsuka', 'Tsuka', 'verb', 'Independence'], ['Nu', 'Nu', '助verb', '*']]], ['0', '1D', [['what', 'what', 'noun', '代noun'], ['But', 'But', 'Particle', '副Particle']]], ['1', '3D', [['dim', 'dim', 'adjective', 'Independence']]], ['2', '3D', [['Damp', 'Damp', 'adverb', 'General'], ['Shi', 'Shi', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*']]], ['3', '5D', [['Place', 'Place', 'noun', '非Independence'], ['so', 'so', 'Particle', '格Particle']]], ['4', '5D', [['Meow meow', 'Meow meow', 'noun', 'General']]], ['5', '7D', [['Crying', 'Crying', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle']]], ['6', '7D', [['いTa事', 'いTa事', 'noun', 'General'], ['Only', 'Only', 'Particle', '副Particle'], ['Is', 'Is', 'Particle', '係Particle']]], ['7', '-1D', [['Memory', 'Memory', 'noun', 'Change connection'], ['Shi', 'Shi', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle'], ['Is', 'Is', 'verb', '非Independence']]]]
[['7', '-1D', [['Memory', 'Memory', 'noun', 'Change connection'], ['Shi', 'Shi', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle'], ['Is', 'Is', 'verb', '非Independence']]], ['0', '5D', [['I', 'I', 'noun', '代noun'], ['Is', 'Is', 'Particle', '係Particle']]], ['1', '2D', [['here', 'here', 'noun', '代noun'], ['so', 'so', 'Particle', '格Particle']]], ['2', '3D', [['start', 'start', 'verb', 'Independence'], ['hand', 'hand', 'Particle', '接続Particle']]], ['3', '4D', [['Human', 'Human', 'noun', 'General'], ['That', 'That', 'Particle', '格Particle']]], ['4', '5D', [['thing', 'thing', 'noun', '非Independence'], ['To', 'To', 'Particle', '格Particle']]], ['5', '-1D', [['You see', 'You see', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*']]]]
[['5', '-1D', [['You see', 'You see', 'verb', 'Independence'], ['Ta', 'Ta', '助verb', '*']]], ['0', '8D', [['Moreover', 'Moreover', 'conjunction', '*']]], ['1', '2D', [['after', 'after', 'noun', 'General'], ['so', 'so', 'Particle', '格Particle']]], ['2', '8D', [['listen', 'listen', 'verb', 'Independence'], ['When', 'When', 'Particle', '接続Particle']]], ['3', '8D', [['It', 'It', 'noun', '代noun'], ['Is', 'Is', 'Particle', '係Particle']]], ['4', '5D', [['Student', 'Student', 'noun', 'General'], ['Whenいう', 'Whenいう', 'Particle', '格Particle']]], ['5', '8D', [['Human', 'Human', 'noun', 'General'], ['During ~', 'During ~', 'noun', 'suffix'], ['so', 'so', 'Particle', '格Particle']]], ['6', '7D', [['Ichiban', 'Ichiban', 'noun', 'Adverbs possible']]], ['7', '8D', [['Evil', 'Evil', 'noun', '形容verb語幹'], ['Nana', 'Nana', '助verb', '*']]], ['8', '-1D', [['Race', 'Race', 'noun', 'General'], ['so', 'so', '助verb', '*'], ['Ah', 'Ah', '助verb', '*'], ['Ta', 'Ta', '助verb', '*'], ['so', 'so', 'noun', 'Special'], ['Is', 'Is', '助verb', '*']]]]

It took me several hours to clear this. Well, I didn't know the class font, so I'm glad I did it. I still don't really see the significance of the class. Isn't it usually possible to create a function inside a function in Python?

42. Display of the phrase of the person concerned and the person concerned

Extract all the text of the original clause and the text of the clause in the tab-delimited format. However, do not output symbols such as punctuation marks.

qiita.rb


for sentence in chunk_transed[1:100]:
  setu = []
  print('¥n')
  for chunk in sentence[1:]:
    surface = str()
    
    for se in chunk[2]:
      if (se[2] != 'symbol'):
        surface += se [0]

    if surface != '':
      setu.append([chunk[0], chunk[1],surface])


  for s in (setu):

    if (s[1] != '-1'):
      saki = s[1]

      for ss in setu:
        if ss[0] == saki:
          print(s[2] +'\t\t'+ ss[2])
#
¥n
I am a cat
¥n
No name
Not yet
¥n
Where were you born
Born or not
I don't get it
I have no idea
¥n
Anything dim
In a dim place
In a damp place
Crying at the place
Meow meow crying
I cry and remember
I remember only what I was
¥n

43. Extract the clauses containing nouns related to the clauses containing verbs

When clauses containing nouns relate to clauses containing verbs, extract them in tab-delimited format. However, do not output symbols such as punctuation marks.

qiita.rb


for sentence in chunk_transed[1:100]:
  setu = []
  print('¥n')
  for chunk in sentence[1:]:
    surface = str()
    hanteiki_meisi = 0 
    hanteiki_dousi = 0 
    for se in chunk[2]:
      if (se[2] != 'symbol'):
        surface += se [0]
      if se [2] == 'noun':
        hanteiki_meisi = 1 #The state where the noun exists in the clause
      if se [2] == 'verb':
        hanteiki_dousi = 1 #The state in which the verb exists in the clause

    if surface != '':
      setu.append([chunk[0], chunk[1],surface, hanteiki_meisi, hanteiki_dousi])


  for s in (setu):

    if (s[1] != '-1') and (s[3] == 1):
      saki = s[1]
      for ss in setu:
        if (ss[0] == saki) and ss[4] == 1:
          print(s[2] +'\t\t'+ ss[2])

44. Visualization of dependent trees

Visualize the dependency tree of a given sentence as a directed graph. For visualization, convert the dependency tree to DOT language and use Graphviz. Also, to visualize directed graphs directly from Python, use pydot.

qiita.rb



#coding:utf-8
import numpy as np
from PIL import Image
import pydot_ng as pydot

#chunk_sentence is(Dependent, destination)Tuples
def graph_maker(chunk_sentence):
  graph = pydot.graph_from_edges(chunk_sentence, directed=True)
  graph.write_png("./result.png ")
  #Loading images
  im = Image.open("./result.png ")
  #Convert image to array
  im_list = np.asarray(im)
  #pasting
  plt.imshow(im_list)
  #display
  plt.show()

def grapf_maker_dot (chunk_sentence):
  graph = pydot.Dot(graph_type='digraph')

qiita.rb


all_edge = []
for sentence in chunk_transed[1:100]:
  setu = []
  for chunk in sentence[1:]:
    surface = str()
    
    for se in chunk[2]:
      if (se[2] != 'symbol'):
        surface += se [0]

    if surface != '':
      setu.append([chunk[0], chunk[1],surface])
  #setu habun
  all_edge_sentence = []
  for s in (setu):
    
    if (s[1] != '-1'):
      saki = s[1]

      for ss in setu:
        if ss[0] == saki:
          edge_sentense = ((s[2] , ss[2]))
          all_edge_sentence.append(edge_sentense)
  all_edge.append(all_edge_sentence)

graph_maker(tuple(all_edge[22]))

45. Extraction of verb case patterns

I would like to consider the sentence used this time as a corpus and investigate the cases that Japanese predicates can take. Think of the verb as a predicate and the particle of the phrase related to the verb as a case, and output the predicate and case in tab-delimited format. However, make sure that the output meets the following specifications.

Consider the example sentence (8th sentence of neko.txt.cabocha) that "I saw a human being for the first time here". This sentence contains two verbs, "begin" and "see", when the phrase "begin" is analyzed as "here" and the phrase as "see" is analyzed as "I am" and "thing". Should produce the following output.

To see

Save the output of this program to a file and check the following items using UNIX commands.

A combination of predicates and case patterns that frequently appear in the corpus Case patterns of the verbs "do", "see", and "give" (arrange in order of frequency of appearance in the corpus)

qiita.rb


kaku = dict()
for sentence in chunk_transed[1:]:
  setu = []
  for chunk in sentence[1:]:
    josi = ''
    dousi = ''
    hanteiki_meisi = 0 
    hanteiki_dousi = 0 
    for se in chunk[2]:
      if (se[2] != 'symbol'):
 #       word += se [0]
        if se [2] == 'Particle':
          hanteiki_meisi = 1 #The state where the noun exists in the clause
          josi = se[0]
        if se [2] == 'verb':
          hanteiki_dousi = 1 #The state in which the verb exists in the clause
          dousi = se[1]

    if surface != '':
      setu.append([chunk[0], chunk[1],josi, dousi, hanteiki_meisi, hanteiki_dousi])


  for s in (setu):

    if (s[1] != '-1') and (s[4] == 1):
      saki = s[1]
      for ss in setu:
        if (ss[0] == saki) and ss[5] == 1:
          if ss[3] in set(kaku.keys()):
            kaku[ss[3]].append(s[2])
          else:
            kaku[ss[3]] = [s[2]]

text = str()
for keys, values in kaku.items():
  values_sort = sorted(set(values))
  josi = keys + '\t'
  for v in values_sort:
    josi += v + ' '
  josi = josi + '\n'
  text += josi
  print(josi)
with open ('./kaku.txt' , mode ='w') as f:
  f.write(text)

47. Functional verb syntax mining

I would like to pay attention only when the verb wo case contains a s-irregular noun. Modify 46 programs to meet the following specifications.

When I reply to the letter, my husband

Save the output of this program to a file and check the following items using UNIX commands.

qiita.rb


puts 'code with syntax'

48. Extracting paths from nouns to roots

For a clause that contains all the nouns in the sentence, extract the path from that clause to the root of the syntax tree. However, the path on the syntax tree shall satisfy the following specifications.

Each clause is represented by a (superficial) morpheme sequence From the start clause to the end clause of the path, concatenate the expressions of each clause with "->" From the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.

I am->saw
here->Start with->Human->Things->saw
Human->Things->saw
Things->saw

qiita.rb


i= 0
for sentence in chunk_transed[1:10]:
  i = i + 1
  setu = []
  print(i, 'Line')
  for chunk in sentence[1:]:
    surface = str()
    
    for se in chunk[2]:
      if (se[2] != 'symbol'):
        surface += se [0]

    if surface != '':
      setu.append([chunk[0], chunk[1],surface])

# setu = (Dependency number, contact number, expression)
  koubunki = ''
  for s in (setu):
    if (s[1] != '-1'):
      saki = s[1]
      koubunki = s[2]
      for ss in setu:

        if ss[0] == saki:
          koubunki += ' --> ' + ss[2]
          saki = ss[1]
          
      print(koubunki)
The first line
I am-->Be a cat
2nd line
Name is-->No
yet-->No
3rd line
where-->Was born-->Do not use
Was born-->Do not use
Tonto-->Do not use
I have a clue-->Do not use

49. Extraction of dependency paths between nouns

Extract the shortest dependency path that connects all noun phrase pairs in a sentence. However, when the phrase number of the noun phrase pair is i and j (i <j), the dependency path shall satisfy the following specifications.

In addition, the shape of the dependency path can be considered in the following two ways.

For example, from the sentence "I saw a human being for the first time here" (8th sentence of neko.txt.cabocha), the following output should be obtained.

X is|In Y->Start with->Human->Things|saw
X is|Called Y->Things|saw
X is|Y|saw
In X->Start with-> Y
In X->Start with->Human-> Y
Called X-> Y

Section 6 [Machine Learning]

In this chapter, we use the News Aggregator Data Set published by Fabio Gasparetti to classify news article headlines into the categories of "business," "science and technology," "entertainment," and "health."

50. Obtaining and shaping data

Download News Aggregator Data Set, and follow the procedure below to train training data (train.txt) and verification data (valid.txt). , Create evaluation data (test.txt).

  1. Unzip the downloaded zip file and read the explanation of readme.txt.
  2. Extract only cases (articles) whose information sources (publishers) are “Reuters”, “Huffington Post”, “Businessweek”, “Contactmusic.com”, and “Daily Mail”.
  3. Randomly sort the extracted cases.
  4. Divide 80% of the extracted cases into training data and the remaining 10% into verification data and evaluation data, and save them with the file names train.txt, valid.txt, and test.txt, respectively. Write one case per line in the file, and use the tab-delimited format of the category name and article headline (this file will be reused later in Problem 70). After creating the learning data and evaluation data, check the number of cases in each category.

qiita.rb


!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00359/NewsAggregatorDataset.zip
import zipfile
with zipfile.ZipFile('./NewsAggregatorDataset.zip') as existing_zip:
    existing_zip.extractall('./')
with open('./readme.txt') as f:
  text = f.read()
  print(text)

qiita.rb


from sklearn.model_selection import train_test_split
import random
#Data to array
with open('./newsCorpora.csv') as f:
  text = f.readline()
  allinfo = []
  print([r for r in text.replace('\n', '').split('\t') if r != ''])
  while text:
    allinfo.append([r for r in text.replace('\n', '').split('\t') if r != ''])
    text = f.readline()
#Data sorting
selectinfo = []
for info in allinfo:
  if info[3] == 'Reuters'or info[3] == 'Huffington Post'or info[3] =='Businessweek'or info[3] =='Contactmusic.com'or info[3] =='Daily Mail':
    selectinfo.append(info)
#Randomize and split data
random.shuffle(selectinfo)
print(selectinfo[:50])
train , testandaccess = train_test_split(selectinfo,train_size = 0.8)
valid, test = train_test_split(testandaccess, train_size = 0.5)
#Data description
with open ('./train.txt', mode = 'w') as f:
  train_txt = str()
  for t in train:
    for i in range(len(t)):
      if i == len(t) -1:
        train_txt += t[i] + '\n'
      else:
        train_txt += t[i] + '\t'
  f.write(train_txt)
with open ('./valid.txt', mode = 'w') as f:
  valid_txt = str()
  for t in valid:
    for i in range(len(t)):
      if i == len(t) -1:
        valid_txt += t[i] + '\n'
      else:
        valid_txt += t[i] + '\t'
  f.write(valid_txt)
with open ('./test.txt', mode = 'w') as f:
  test_txt = str()
  for t in test:
    for i in range(len(t)):
      if i == len(t) -1:
        test_txt += t[i] + '\n'
      else:
        test_txt += t[i] + '\t'
  f.write(test_txt)

51. Feature extraction

Extract the features from the training data, verification data, and evaluation data, and save them with the file names train.feature.txt, valid.feature.txt, and test.feature.txt, respectively. Feel free to design the features that are likely to be useful for categorization. The minimum baseline would be an article headline converted to a word string.

qiita.rb


with open('./train.txt') as f:
  with open('./valid.txt') as ff:
    with open('./test.txt') as fff:
      i = 0
      name = ['train', 'valid', 'test']
      FF = [f,ff,fff]
      data_dict = dict() 
      for F in FF:
        text = F.readline()
        a = 'feature_' + name[i]
        array = []
        while text:
          t = ([r for r in text.replace('\n', '').split('\t') if r != ''])
          tt = [t[i] for i in range(len(t)) if i == 1 or i == 3 or i == 4 ]
          array.append(tt)
          text = F.readline()
        data_dict[a] = array
        i = i + 1

print(len(valid))
print(len(data_dict['feature_valid']))
print('The numbers match → The above work was done normally')

with open('./train.feature.txt', mode = 'w') as f:
  with open('./valid.feature.txt', mode = 'w') as ff:
    with open('./test.feature.txt', mode = 'w') as fff:
      name = ['feature_train', 'feature_valid', 'feature_test']
      FF = [f,ff,fff]

      for i in range(len(FF)):
          txt = str()
          for t in data_dict[name[i]]:

            for l in range(len(t)):
              if l == len(t) -1:
                txt += t[l] + '\n'
              else:
                txt += t[l] + '\t'
          FF[i].write(txt)
with open('./valid.feature.txt') as f:
  t = f.read()
  print('[Part of the Valid file]')
  print(t[:100])

52. Learning

Learn the logistic regression model using the training data constructed in 51. Review of logistic regression => Classification problem, not regression

Review of logistic regression => Classification problem, not regression

L(\theta) = \sum_{i=0}^{N} \log p(x | y)

Reduced to optimization problems this time

$p(x | y)|_{y_i = 1} = f = \frac{1} {1 + \exp( -{\theta^{ T}} {\bf x} ) } $

And define the softmax function

==>

$ \theta_i <= \theta_i + \eta \frac{\partial L}{\partial \theta_i} $

here,

\frac{\partial L}{\partial \theta_i} = \frac{\partial L}{\partial f_{\theta}(x)}\frac{\partial f_{\theta}(x)}{\partial \theta_i}

When the first term is calculated

$\frac{\partial L}{\partial \theta_i} = \frac{y_i - f_{\theta}(x)}{f_{\theta}(x)(1-f_{\theta}(x))}\frac{\partial f_{\theta}(x)}{\partial \theta_i} $

A new function u is defined to differentiate the second term.

$u = 1 + e^{-{\bf \theta}^{\mathrm{T}}{\bf x}} $

If it is defined, it will be as follows.

\frac {\partial u}{\partial \theta_i} = \frac {\partial}{\partial \theta_i} exp(-(x_1\theta_1 + \cdot\cdot\cdot+x_N\theta_N)) = -x_i e^{-{\bf \theta}^{\mathrm{T}}{\bf x}}

As a result, the derivative of $ f $ can be expressed as follows. In summary, the differential value for backpropagation of error can be expressed as follows.

$ \frac{\partial f_{\theta}(x)} {\partial u} \frac {\partial u}{\partial \theta_i} = f_{\theta}(1-f_{\theta})x_i $

\frac{\partial L}{\partial \theta_i} = \sum_{i=0}^{N} (y^{i} - f_{\theta}(x)) {\bf x}_i

If you make it a little more neural. $f = \frac{1} {1 + \exp( - {\bf x} ) } $

$ \delta = {\bf y} - {\bf t} $

$\nabla_{\bf W} E = \frac{1}{N}\delta {\bf x}^{\mathrm{T}} $

$\nabla_{\bf b} E = \frac{1}{N}\delta \mathbb{1}_N $

${\bf W} \leftarrow {\bf W} - \epsilon \nabla_{\bf W} E $

${\bf b} \leftarrow {\bf b} - \epsilon \nabla_{\bf b} E $

This time select the latter

Editing data

qiita.rb


with open('./train.feature.txt') as f:
  with open('./valid.feature.txt') as ff:
    with open('./test.feature.txt') as fff:
      i = 0
      name = ['train', 'valid', 'test']
      FF = [f,ff,fff]
      data_dict = dict() 
      for F in FF:
        text = F.readline()
        a = 'feature_' + name[i]
        array = []
        while text:
          tt = ([r for r in text.replace('\n', '').split('\t') if r != ''])
          #tt = [t[i] for i in range(len(t)) if i == 1 or i == 3 or i == 4 ]
          array.append(tt)
          text = F.readline()
        data_dict[a] = array
        i = i + 1
print(data_dict['feature_train'])
x, y = dict(), dict()
for n in name:
  a = 'feature_' + n
  data = data_dict[a]
  x[a] = [[r[0], r[1]] for r in data ]
  t[a] = [r[2] for r in data ]
print(x['feature_train'])
print(t['feature_train'])

I have to convert the text to a vector, so the code to convert

qiita.rb


#Character vectorization
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

#Finally only the part is extracted
def summary (data):
  text = list()
  for xx in data:
    text.append((xx[0]))
  return text

#Is the label answer vector?
def ans_to_vector (text):
  vec = list()
  for tt in text:
    if tt == 'b':
      vec.append(0)
    if tt == 'e':
      vec.append(1)
    if tt == 't':
      vec.append(2)
    if tt == 'm':
      vec.append(3)
  return np.eye(len(np.unique(vec)))[vec]

#The argument is x[''feature']
def publisher_to_vector (text):
  vec = list()
  for tt in text:
    if tt[1] == 'Reuters':
      vec.append(0)
    if tt[1] == 'Huffington Post':
      vec.append(1)
    if tt[1] == 'Businessweek':
      vec.append(2)
    if tt[1] == 'Contactmusic.com':
      vec.append(3)
    if tt[1] == 'Daily Mail':
      vec.append(4)
  
  onehotvec = np.eye(len(np.unique(vec)))[vec]
  return np.array(onehotvec)

#TfidfVectorizer
def texttovec(textlist):
  vec = TfidfVectorizer().fit_transform(textlist)
  return vec

#Vector addition (not used)
def vecplusvec (vec1, vec2):
  v = list()
  for v1, v2 in zip(vec1, vec2):
    print(v1[None,:].shape, v2[None,:].shape )
    v.append(v1 + v2[None, :])
    print('A')
  return v
#Is it finally a vector?
text_train = summary(x['feature_train'])
text_valid = summary(x['feature_valid'])
text = text_train + (text_valid)
vec_text = TfidfVectorizer().fit_transform(text_train).toarray()
#Vector split
vec_train, vec_valid = list(), list()
for i in range(vec_text.shape[0]):
  if i < len(text_train):
    vec_train.append(vec_text[i])
  if i < len(text_valid):
    vec_valid.append(vec_text[i])
#Pass the vector to Numpy
vec_train = np.array(vec_train)
vec_valid= np.array(vec_valid)
#Answers and publisher vectorization
vec_train_ans = ans_to_vector((t['feature_train']))
vec_valid_ans = ans_to_vector((t['feature_valid']))
vec_train_publisher = publisher_to_vector(x['feature_train'])
vec_valid_publisher = publisher_to_vector(x['feature_valid'])

#Input vector combination
vec_train = np.concatenate([vec_train, vec_train_publisher],axis = 1)
vec_valid = np.concatenate([vec_valid, vec_valid_publisher],1)

print("Input dimension",vec_train.shape)
print('Label (answer) dimension',vec_train_ans.shape)
print("Publisher name vector",vec_train_publisher.shape)

#Input dimension(10684, 12783)
#Label (answer) dimension(10684, 4)
#Publisher name vector(10684, 5)

Coding to learn

qiita.rb


import numpy as np
from sklearn.metrics import accuracy_score

#Prevent the contents of log from becoming 0
def np_log(x):
    return np.log(np.clip(a=x, a_min=1e-10, a_max=1e+10))

def sigmoid(x):
#     return 1 / (1 + np.exp(- x))
    return np.tanh(x * 0.5) * 0.5 + 0.5 #Use numpy built-in tanh(Prevent overflow of exp)

W, b = np.random.uniform(-0.08, 0.08, size = ( vec_train.shape[1],4)), np.zeros(shape = (4,)).astype('float32')
#Learning
def train (x, t, eps = 1):
    global W , b
    batch_size = x.shape[0]
    y = sigmoid(np.matmul(x, W) + b) # shape: (batch_size,Number of dimensions of output) #matmaul:inner product
    #Backpropagation
    cost = (- t * np_log(y) - (1 - t) * np_log(1 - y)).mean()
    delta = y - t # shape: (batch_size,Number of dimensions of output) 

    #Parameter update
    dW = np.matmul(x.T, delta) / batch_size # shape: (Number of input dimensions,Number of dimensions of output)
    db = np.matmul(np.ones(shape=(batch_size,)), delta) / batch_size # shape: (Number of dimensions of output,)
    W -= eps * dW
    b -= eps * db

    return cost
#Verification
def valid(x, t):
    y = sigmoid(np.matmul(x, W) + b)
    cost = (- t * np_log(y) - (1 - t) * np_log(1 - y)).mean()
    return cost, y
#Implementation
for epoch in range(3):
    for x, t in zip(vec_train, vec_train_ans):
        cost = train(x[None, :], t[None, :])
    cost, y_pred = valid(vec_valid, vec_valid_ans)
    print('EPOCH: {}, Valid Cost: {:.3f}, Valid Accuracy: {:.3f}'.format(
        epoch + 1,
        cost,
        accuracy_score(vec_valid_ans.argmax(axis=1), y_pred.argmax(axis=1))
    ))

qiita.rb



#EPOCH: 1, Valid Cost: 0.477, Valid Accuracy: 0.647
#EPOCH: 2, Valid Cost: 0.573, Valid Accuracy: 0.598
#EPOCH: 3, Valid Cost: 0.638, Valid Accuracy: 0.570

Since the number of labels is 5, the minimum correct answer rate is 20%. Is the correct answer rate 65% a generally good value? ?? (I didn't really understand the vectorization of the text, and it became quite sparse. You can see that you are overfitting I just did logistic regression in the class of a certain deep learning teacher in the graduate school class, and implemented it after application.

54. Measurement of correct answer rate

Measure the correct answer rate of the logistic regression model learned in 52 on the training data and evaluation data.

qiita.rb


y = np.round(y_pred, 2)
j = 0
for i in range(y.shape[0]):  
  if (y_pred.argmax(axis = 1)[i] == vec_valid_ans.argmax( axis = 1)[i]):
    j = j +1
print('The correct answer rate is', j/(y.shape[0]))  
print('The correct answer rate is (command)= ', accuracy_score(y_pred.argmax(axis = 1), vec_valid_ans.ar
gmax( axis = 1)))

55. Creating a confusion matrix

Create a confusion matrix of the logistic regression model learned in 52 on the training data and evaluation data.

qiita.rb


from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_pred.argmax(axis = 1), vec_valid_ans.argmax( axis = 1))
print(cm)

56. Measurement of precision, recall, F1 score

Measure the precision, recall, and F1 score of the logistic regression model learned in 52 on the evaluation data. Find the precision, recall, and F1 score for each category, and integrate the performance for each category with micro-average and macro-average.

Review

Recall rate: Percentage of actually positive samples that answered correctly

recall = \frac{TP}{TP+FN}

Singular value: The percentage of the number of data that you do not want to discriminate that was not actually discriminated (how much was the correct answer among the data other than cats?) Conformity rate: The percentage of the number of discriminated data that is correct (the percentage of the discriminated images that is the actual image)

precision = \frac{TP}{TP+FP}

Negative predictive value: Of the number of data that is not subject to discrimination, the percentage of images that are correct (other than cats!) Is actually images other than cats. Percentage)

F1 value ・ ・ ・ F1 value (F1-measure) is the harmonic mean of precision and recall.

$F1 = \frac{2TP}{2TP + FP + FN} $

qiita.rb


from sklearn.metrics import classification_report
print(classification_report(vec_valid_ans.argmax( axis = 1), y_pred.argmax(axis = 1)))

qiita.rb


              precision    recall  f1-score   support

           0       0.63      0.77      0.69       588
           1       0.71      0.63      0.67       518
           2       0.13      0.12      0.13       145
           3       0.17      0.07      0.10        85

    accuracy                           0.60      1336
   macro avg       0.41      0.40      0.40      1336
weighted avg       0.58      0.60      0.58      1336

Reference URL

Answer and impression of 100 language processing knocks-Part 1 [Enable MeCab in Colaboratory. ] (https://qiita.com/pytry3g/items/897ae738b8fbd3ae7893) Deep learning class Machine learning ~ Text features (CountVectorizer, TfidfVectorizer) ~ Python

Recommended Posts

100 language processing knock 2020 "for Google Colaboratory"
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
Natural language processing for busy people
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
■ [Google Colaboratory] Preprocessing of Natural Language Processing & Morphological Analysis (janome)
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2020 Chapter 3: Regular Expressions