The 2020 version of 100 language processing knocks has been released.

https://nlp100.github.io/ja/

In this article, I will post the answers in a straightforward manner for easy reference. The explanation is written in another article.

Continued

Language processing 100 knocks 2020 [Chapter 5: Dependency analysis answer]

Chapter 1: Preparatory movement

https://kakedashi-engineer.appspot.com/2020/04/15/nlp100-00-09/

00. Reverse order of strings

s = 'stressed'
print (s[::-1])

01. "Patatokukashi"

s = 'Patatoku Kashii'
print (s[::2])

02. "Police car" + "Taxi" = "Patatokukashi"

s1 = 'Police car'
s2 = 'taxi'
print (''.join([a+b for a,b in zip(s1,s2)]))

03. Pi

s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
s = s.replace(',','').replace('.','')
[len(w) for w in s.split()]

04. Element symbol

s = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
s = s.replace('.','')
idx = [1, 5, 6, 7, 8, 9, 15, 16, 19]
mp = {}
for i,w in enumerate(s.split()):
    if (i+1) in idx:
        v = w[:1]
    else:
        v = w[:2]
    mp[v] = i+1
print (mp)

n-gram

def ngram(S, n):
    r = []
    for i in range(len(S) - n + 1):
        r.append(S[i:i+n])
    return r
s = 'I am an NLPer'
print (ngram(s.split(),2))
print (ngram(s,2))

06. Meeting

def ngram(S, n):
    r = []
    for i in range(len(S) - n + 1):
        r.append(S[i:i+n])
    return r
s1 = 'paraparaparadise'
s2 = 'paragraph'
st1 = set(ngram(s1, 2))
st2 = set(ngram(s2, 2))
print(st1 | st2)
print(st1 & st2)
print(st1 - st2)
print('se' in st1)
print('se' in st2)

07. Sentence generation by template

def temperature(x,y,z):
    return str(x)+'of time'+str(y)+'Is'+str(z)
x = 12
y = 'temperature'
z = 22.4
print (temperature(x,y,z))

08. Ciphertext

def cipher(S):
    new = []
    for s in S:
        if 97 <= ord(s) <= 122:
            s = chr(219 - ord(s))
        new.append(s)
    return ''.join(new)
        
s = 'I am an NLPer'
new = cipher(s)
print (new)

print (cipher(new))

Typoglycemia

import random
s = 'I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .'
ans = []
text = s.split()
for word in text:
    if (len(word)>4):
        mid = list(word[1:-1])
        random.shuffle(mid)
        word = word[0] + ''.join(mid) + word[-1]
        ans.append(word)
    else:
        ans.append(word)
print (' '.join(ans))

Chapter 2: UNIX Commands

https://kakedashi-engineer.appspot.com/2020/04/16/nlp100-10-14/ https://kakedashi-engineer.appspot.com/2020/04/17/nlp100-15-19/

10. Counting the number of lines

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (len(df))

wc popular-names.txt

11. Replace tabs with spaces

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.to_csv('space.txt', sep=' ',header=False, index=False)

sed 's/\t/ /g' popular-names.txt > replaced.txt

12. Save the first column in col1.txt and the second column in col2.txt

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.iloc[:,0].to_csv('col1.txt', sep=' ',header=False, index=False)
df.iloc[:,1].to_csv('col2.txt', sep=' ',header=False, index=False)

cut -f 1  popular-names.txt > col1.txt
cut -f 2  popular-names.txt > col2.txt

13. Merge col1.txt and col2.txt

import pandas as pd
df1 = pd.read_csv('col1.txt', delimiter='\t', header=None)
df2 = pd.read_csv('col2.txt', delimiter='\t', header=None)
df = pd.concat([df1, df2], axis=1)
df.to_csv('col1_2.txt', sep='\t',header=False, index=False)

paste col1.txt col2.txt > col1_2.txt

14. Output N lines from the beginning

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (df.head(5))

head -n 5 popu

15. Output the last N lines

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (df.tail(5))

tail -n 5 popular-names.txt

16. Divide the file into N

N = 3
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
step = - (-len(df) // N)
for n in range(N):
    df_split = df.iloc[n*step:(n+1)*step]
    df_split.to_csv('popular-names'+str(n)+'.txt', sep='\t',header=False, index=False)

split -n 3 popuar-names.txt

17. Difference in the character string in the first column

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
new = df[0].unique()
new.sort()
print (new)

cut -f 1  popular-names.txt | sort | uniq

18. Sort each row in descending order of the numbers in the third column

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
new = df[2].sort_values(ascending=False)
print (new)

cut -f 3  popular-names.txt | sort -n -r

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
vc = df[0].value_counts()
vc = pd.DataFrame(vc)
vc = vc.reset_index()
vc.columns = ['name','count']
vc = vc.sort_values(['count','name'],ascending=[False,False])
print (vc)

cut -f 1  popular-names.txt | sort | uniq -c | sort -n -r

Chapter 3: Regular Expressions

https://kakedashi-engineer.appspot.com/2020/04/18/nlp100-20-24/ https://kakedashi-engineer.appspot.com/2020/04/19/nlp100-25-26/ https://kakedashi-engineer.appspot.com/2020/04/20/nlp100-27-28/ https://kakedashi-engineer.appspot.com/2020/04/21/nlp100-29-30/

20. Read JSON data

import pandas as pd
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
print (uk)

21. Extract rows containing category names

import pandas as pd
import re
pattern = re.compile('Category')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        print (line)

22. Extraction of category name

import pandas as pd
import re
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        line = line.replace('[[','').replace('Category:','').replace(']]','').replace('|*','').replace('|Former','')
        print (line)

23. Section structure

import pandas as pd
import re
pattern = re.compile('^=+.*=+$') #More than once=Starting with, more than once=String ending with
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        level = line.count('=') // 2 - 1
        print(line.replace('=',''), level )

24. Extracting file references

import pandas as pd
import re
pattern = re.compile('File|File:(.+?)\|')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    r = re.findall(pattern, line)
    if r:
        print (r[0])

25. Template extraction

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

26. Removal of highlighted markup

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
    r = re.sub(p_emp,'\\1', line)
    print (r)
print (d)

27. Removal of internal links

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
p_link = re.compile('\[\[(.+?)\]\]')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
lines = uk[0]
lines = re.sub(p_emp,'\\1', lines)
lines = re.sub(p_link,'\\1', lines)
ls = lines.split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

28. Removal of MediaWiki markup

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
p_link = re.compile('\[\[(.+?)\]\]')
p_refbr = re.compile('<[br|ref][^>]*?>.+?<\/[br|ref][^>]*?>')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
lines = uk[0]
lines = re.sub(p_emp,'\\1', lines)
lines = re.sub(p_link,'\\1', lines)
lines = re.sub(p_refbr,'', lines)
ls = lines.split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

29. Get the URL of the national flag image

import pandas as pd
import re
import requests
pattern = re.compile('\|(.+?)\s=\s*(.+)')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
        
S = requests.Session()
URL = "https://commons.wikimedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "format": "json",
    "titles": "File:" + d['National flag image'],
    "prop": "imageinfo",
    "iiprop":"url"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA['query']['pages']
for k, v in PAGES.items():
    print (v['imageinfo'][0]['url'])

Chapter 4: Morphological analysis

https://kakedashi-engineer.appspot.com/2020/04/22/nlp100-31-34/ https://kakedashi-engineer.appspot.com/2020/04/22/nlp100-35-39/

30. Reading morphological analysis results

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)

31. Verb

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
[d['surface'] for d in result if d['pos'] == 'verb' ]

32. The original form of the verb

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
[d['base'] for d in result if d['pos'] == 'verb' ]

33. "B of A"

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
noun_phrase = []
for i in range(len(result)-2):
    if (result[i]['pos'] == 'noun' and result[i+1]['surface'] == 'of' and result[i+2]['pos'] == 'noun'):
        noun_phrase.append(result[i]['surface']+result[i+1]['surface']+result[i+2]['surface'])
print (noun_phrase)

34. Noun articulation

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)

ls_noun = []
noun = ''
for d in result:
    if d['pos']=='noun':
            noun += d['surface']
    else:
        if noun != '':
            ls_noun.append(noun)
            noun = ''
else:
    if noun != '':
        ls_noun.append(noun)
        noun = ''
print (ls_noun)

35. Frequency of word occurrence

import MeCab
from collections import Counter
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
print (c.most_common())

36. Top 10 most frequent words

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
target = list(zip(*c.most_common(10)))
plt.bar(*target)
plt.show()

37. Top 10 words that frequently co-occur with "cat"

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
tmp_cooccurrence = []
cooccurrence = []
inCat = False

for line in text[:-1]:
    if line == 'EOS':
        if inCat:
            cooccurrence.extend(tmp_cooccurrence)
        else:
            pass
        tmp_cooccurrence = []
        inCat = False
        continue
            
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
    if ls[0]!='Cat':
        tmp_cooccurrence.append(ls[0])
    else:
        inCat = True


c = Counter(cooccurrence)
target = list(zip(*c.most_common(10)))
plt.bar(*target)
plt.show()

38. Histogram

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
plt.hist(c.values(),  range = (1,10))
plt.show()

39. Zipf's Law

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
surface =  [d['surface'] for d in result]
c = Counter(surface)
v = [kv[1] for kv in c.most_common()]
plt.scatter(np.log(range(len(v))),np.log(v))
plt.show()

Continued

Language processing 100 knocks 2020 [00 ~ 49 answer]

[PYTHON] 100 language processing knock 2020 [00 ~ 39 answer]

Continued

Chapter 1: Preparatory movement

00. Reverse order of strings

01. "Patatokukashi"

02. "Police car" + "Taxi" = "Patatokukashi"

03. Pi

04. Element symbol

06. Meeting

07. Sentence generation by template

08. Ciphertext

Chapter 2: UNIX Commands

10. Counting the number of lines

11. Replace tabs with spaces

12. Save the first column in col1.txt and the second column in col2.txt

13. Merge col1.txt and col2.txt

14. Output N lines from the beginning

15. Output the last N lines

16. Divide the file into N

17. Difference in the character string in the first column

18. Sort each row in descending order of the numbers in the third column

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

Chapter 3: Regular Expressions

20. Read JSON data

21. Extract rows containing category names

22. Extraction of category name

23. Section structure

24. Extracting file references

25. Template extraction

26. Removal of highlighted markup

27. Removal of internal links

28. Removal of MediaWiki markup

29. Get the URL of the national flag image

Chapter 4: Morphological analysis

30. Reading morphological analysis results

31. Verb

32. The original form of the verb

33. "B of A"

34. Noun articulation

35. Frequency of word occurrence

36. Top 10 most frequent words

37. Top 10 words that frequently co-occur with "cat"

38. Histogram

39. Zipf's Law

Continued