[PYTHON] 100 language processing knock 2020 [00 ~ 39 answer]

The 2020 version of 100 language processing knocks has been released.

https://nlp100.github.io/ja/

In this article, I will post the answers in a straightforward manner for easy reference. The explanation is written in another article.

Continued

Language processing 100 knocks 2020 [Chapter 5: Dependency analysis answer]

Chapter 1: Preparatory movement

https://kakedashi-engineer.appspot.com/2020/04/15/nlp100-00-09/

00. Reverse order of strings

s = 'stressed'
print (s[::-1])

01. "Patatokukashi"

s = 'Patatoku Kashii'
print (s[::2])

02. "Police car" + "Taxi" = "Patatokukashi"

s1 = 'Police car'
s2 = 'taxi'
print (''.join([a+b for a,b in zip(s1,s2)]))

03. Pi

s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
s = s.replace(',','').replace('.','')
[len(w) for w in s.split()]

04. Element symbol

s = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
s = s.replace('.','')
idx = [1, 5, 6, 7, 8, 9, 15, 16, 19]
mp = {}
for i,w in enumerate(s.split()):
    if (i+1) in idx:
        v = w[:1]
    else:
        v = w[:2]
    mp[v] = i+1
print (mp)
  1. n-gram
def ngram(S, n):
    r = []
    for i in range(len(S) - n + 1):
        r.append(S[i:i+n])
    return r
s = 'I am an NLPer'
print (ngram(s.split(),2))
print (ngram(s,2))

06. Meeting

def ngram(S, n):
    r = []
    for i in range(len(S) - n + 1):
        r.append(S[i:i+n])
    return r
s1 = 'paraparaparadise'
s2 = 'paragraph'
st1 = set(ngram(s1, 2))
st2 = set(ngram(s2, 2))
print(st1 | st2)
print(st1 & st2)
print(st1 - st2)
print('se' in st1)
print('se' in st2)

07. Sentence generation by template

def temperature(x,y,z):
    return str(x)+'of time'+str(y)+'Is'+str(z)
x = 12
y = 'temperature'
z = 22.4
print (temperature(x,y,z))

08. Ciphertext

def cipher(S):
    new = []
    for s in S:
        if 97 <= ord(s) <= 122:
            s = chr(219 - ord(s))
        new.append(s)
    return ''.join(new)
        
s = 'I am an NLPer'
new = cipher(s)
print (new)

print (cipher(new))
  1. Typoglycemia
import random
s = 'I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .'
ans = []
text = s.split()
for word in text:
    if (len(word)>4):
        mid = list(word[1:-1])
        random.shuffle(mid)
        word = word[0] + ''.join(mid) + word[-1]
        ans.append(word)
    else:
        ans.append(word)
print (' '.join(ans))

Chapter 2: UNIX Commands

https://kakedashi-engineer.appspot.com/2020/04/16/nlp100-10-14/ https://kakedashi-engineer.appspot.com/2020/04/17/nlp100-15-19/

10. Counting the number of lines

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (len(df))
wc popular-names.txt

11. Replace tabs with spaces

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.to_csv('space.txt', sep=' ',header=False, index=False)
sed 's/\t/ /g' popular-names.txt > replaced.txt

12. Save the first column in col1.txt and the second column in col2.txt

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.iloc[:,0].to_csv('col1.txt', sep=' ',header=False, index=False)
df.iloc[:,1].to_csv('col2.txt', sep=' ',header=False, index=False)
cut -f 1  popular-names.txt > col1.txt
cut -f 2  popular-names.txt > col2.txt

13. Merge col1.txt and col2.txt

import pandas as pd
df1 = pd.read_csv('col1.txt', delimiter='\t', header=None)
df2 = pd.read_csv('col2.txt', delimiter='\t', header=None)
df = pd.concat([df1, df2], axis=1)
df.to_csv('col1_2.txt', sep='\t',header=False, index=False)
paste col1.txt col2.txt > col1_2.txt

14. Output N lines from the beginning

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (df.head(5))
head -n 5 popu

15. Output the last N lines

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (df.tail(5))
tail -n 5 popular-names.txt

16. Divide the file into N

N = 3
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
step = - (-len(df) // N)
for n in range(N):
    df_split = df.iloc[n*step:(n+1)*step]
    df_split.to_csv('popular-names'+str(n)+'.txt', sep='\t',header=False, index=False)
split -n 3 popuar-names.txt

17. Difference in the character string in the first column

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
new = df[0].unique()
new.sort()
print (new)
cut -f 1  popular-names.txt | sort | uniq

18. Sort each row in descending order of the numbers in the third column

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
new = df[2].sort_values(ascending=False)
print (new)
cut -f 3  popular-names.txt | sort -n -r

19. Find the frequency of appearance of the character string in the first column of each line, and arrange them in descending order of frequency of appearance.

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
vc = df[0].value_counts()
vc = pd.DataFrame(vc)
vc = vc.reset_index()
vc.columns = ['name','count']
vc = vc.sort_values(['count','name'],ascending=[False,False])
print (vc)
cut -f 1  popular-names.txt | sort | uniq -c | sort -n -r

Chapter 3: Regular Expressions

https://kakedashi-engineer.appspot.com/2020/04/18/nlp100-20-24/ https://kakedashi-engineer.appspot.com/2020/04/19/nlp100-25-26/ https://kakedashi-engineer.appspot.com/2020/04/20/nlp100-27-28/ https://kakedashi-engineer.appspot.com/2020/04/21/nlp100-29-30/

20. Read JSON data

import pandas as pd
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
print (uk)

21. Extract rows containing category names

import pandas as pd
import re
pattern = re.compile('Category')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        print (line)

22. Extraction of category name

import pandas as pd
import re
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        line = line.replace('[[','').replace('Category:','').replace(']]','').replace('|*','').replace('|Former','')
        print (line)

23. Section structure

import pandas as pd
import re
pattern = re.compile('^=+.*=+$') #More than once=Starting with, more than once=String ending with
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        level = line.count('=') // 2 - 1
        print(line.replace('=',''), level )

24. Extracting file references

import pandas as pd
import re
pattern = re.compile('File|File:(.+?)\|')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    r = re.findall(pattern, line)
    if r:
        print (r[0])

25. Template extraction

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

26. Removal of highlighted markup

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
    r = re.sub(p_emp,'\\1', line)
    print (r)
print (d)

27. Removal of internal links

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
p_link = re.compile('\[\[(.+?)\]\]')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
lines = uk[0]
lines = re.sub(p_emp,'\\1', lines)
lines = re.sub(p_link,'\\1', lines)
ls = lines.split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

28. Removal of MediaWiki markup

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
p_link = re.compile('\[\[(.+?)\]\]')
p_refbr = re.compile('<[br|ref][^>]*?>.+?<\/[br|ref][^>]*?>')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
lines = uk[0]
lines = re.sub(p_emp,'\\1', lines)
lines = re.sub(p_link,'\\1', lines)
lines = re.sub(p_refbr,'', lines)
ls = lines.split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

29. Get the URL of the national flag image

import pandas as pd
import re
import requests
pattern = re.compile('\|(.+?)\s=\s*(.+)')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
        
S = requests.Session()
URL = "https://commons.wikimedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "format": "json",
    "titles": "File:" + d['National flag image'],
    "prop": "imageinfo",
    "iiprop":"url"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA['query']['pages']
for k, v in PAGES.items():
    print (v['imageinfo'][0]['url'])

Chapter 4: Morphological analysis

https://kakedashi-engineer.appspot.com/2020/04/22/nlp100-31-34/ https://kakedashi-engineer.appspot.com/2020/04/22/nlp100-35-39/

30. Reading morphological analysis results

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)

31. Verb

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
[d['surface'] for d in result if d['pos'] == 'verb' ]

32. The original form of the verb

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
[d['base'] for d in result if d['pos'] == 'verb' ]

33. "B of A"

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
noun_phrase = []
for i in range(len(result)-2):
    if (result[i]['pos'] == 'noun' and result[i+1]['surface'] == 'of' and result[i+2]['pos'] == 'noun'):
        noun_phrase.append(result[i]['surface']+result[i+1]['surface']+result[i+2]['surface'])
print (noun_phrase)

34. Noun articulation

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)

ls_noun = []
noun = ''
for d in result:
    if d['pos']=='noun':
            noun += d['surface']
    else:
        if noun != '':
            ls_noun.append(noun)
            noun = ''
else:
    if noun != '':
        ls_noun.append(noun)
        noun = ''
print (ls_noun)

35. Frequency of word occurrence

import MeCab
from collections import Counter
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
print (c.most_common())

36. Top 10 most frequent words

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
target = list(zip(*c.most_common(10)))
plt.bar(*target)
plt.show()

37. Top 10 words that frequently co-occur with "cat"

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
tmp_cooccurrence = []
cooccurrence = []
inCat = False

for line in text[:-1]:
    if line == 'EOS':
        if inCat:
            cooccurrence.extend(tmp_cooccurrence)
        else:
            pass
        tmp_cooccurrence = []
        inCat = False
        continue
            
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
    if ls[0]!='Cat':
        tmp_cooccurrence.append(ls[0])
    else:
        inCat = True


c = Counter(cooccurrence)
target = list(zip(*c.most_common(10)))
plt.bar(*target)
plt.show()

38. Histogram

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
plt.hist(c.values(),  range = (1,10))
plt.show()

39. Zipf's Law

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
surface =  [d['surface'] for d in result]
c = Counter(surface)
v = [kv[1] for kv in c.most_common()]
plt.scatter(np.log(range(len(v))),np.log(v))
plt.show()

Continued

Language processing 100 knocks 2020 [00 ~ 49 answer]

Recommended Posts

100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 language processing knock 2020 [00 ~ 49 answer]
100 language processing knock 2020 [00 ~ 59 answer]
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 language processing knocks 2020 [00 ~ 89 answer]
100 Amateur Language Processing Knock: 07
Language processing 100 knocks 00 ~ 09 Answer
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement