Die 2020-Version von 100 Sprachverarbeitungsklopfen wurde veröffentlicht.

https://nlp100.github.io/ja/

In diesem Artikel werde ich die Antworten auf einfache Weise veröffentlichen, um sie leichter nachschlagen zu können. Die Erklärung ist in einem anderen Artikel geschrieben.

Fortsetzung

Sprachverarbeitung 100 Knock 2020 [Kapitel 5: Antwort auf Abhängigkeitsanalyse]

Kapitel 1: Vorbereitende Bewegung

https://kakedashi-engineer.appspot.com/2020/04/15/nlp100-00-09/

00. Umgekehrte Reihenfolge der Zeichenfolgen

s = 'stressed'
print (s[::-1])

01. "Patatokukashi"

s = 'Patatoku Kashii'
print (s[::2])

02. "Patcar" + "Tax" = "Patatokukasie"

s1 = 'Pat Auto'
s2 = 'Taxi'
print (''.join([a+b for a,b in zip(s1,s2)]))

03. Umfangsrate

s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
s = s.replace(',','').replace('.','')
[len(w) for w in s.split()]

04. Elementsymbol

s = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
s = s.replace('.','')
idx = [1, 5, 6, 7, 8, 9, 15, 16, 19]
mp = {}
for i,w in enumerate(s.split()):
    if (i+1) in idx:
        v = w[:1]
    else:
        v = w[:2]
    mp[v] = i+1
print (mp)

n-gram

def ngram(S, n):
    r = []
    for i in range(len(S) - n + 1):
        r.append(S[i:i+n])
    return r
s = 'I am an NLPer'
print (ngram(s.split(),2))
print (ngram(s,2))

06. Treffen

def ngram(S, n):
    r = []
    for i in range(len(S) - n + 1):
        r.append(S[i:i+n])
    return r
s1 = 'paraparaparadise'
s2 = 'paragraph'
st1 = set(ngram(s1, 2))
st2 = set(ngram(s2, 2))
print(st1 | st2)
print(st1 & st2)
print(st1 - st2)
print('se' in st1)
print('se' in st2)

07. Anweisungsgenerierung nach Vorlage

def temperature(x,y,z):
    return str(x)+'von Zeit'+str(y)+'Ist'+str(z)
x = 12
y = 'Temperatur'
z = 22.4
print (temperature(x,y,z))

08. Kryptographie

def cipher(S):
    new = []
    for s in S:
        if 97 <= ord(s) <= 122:
            s = chr(219 - ord(s))
        new.append(s)
    return ''.join(new)
        
s = 'I am an NLPer'
new = cipher(s)
print (new)

print (cipher(new))

Typoglycemia

import random
s = 'I couldn’t believe that I could actually understand what I was reading : the phenomenal power of the human mind .'
ans = []
text = s.split()
for word in text:
    if (len(word)>4):
        mid = list(word[1:-1])
        random.shuffle(mid)
        word = word[0] + ''.join(mid) + word[-1]
        ans.append(word)
    else:
        ans.append(word)
print (' '.join(ans))

Kapitel 2: UNIX-Befehle

https://kakedashi-engineer.appspot.com/2020/04/16/nlp100-10-14/ https://kakedashi-engineer.appspot.com/2020/04/17/nlp100-15-19/

10. Zählen der Anzahl der Zeilen

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (len(df))

wc popular-names.txt

11. Ersetzen Sie die Registerkarten durch Leerzeichen

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.to_csv('space.txt', sep=' ',header=False, index=False)

sed 's/\t/ /g' popular-names.txt > replaced.txt

12. Speichern Sie die erste Spalte in col1.txt und die zweite Spalte in col2.txt

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
df.iloc[:,0].to_csv('col1.txt', sep=' ',header=False, index=False)
df.iloc[:,1].to_csv('col2.txt', sep=' ',header=False, index=False)

cut -f 1  popular-names.txt > col1.txt
cut -f 2  popular-names.txt > col2.txt

13. Führen Sie col1.txt und col2.txt zusammen

import pandas as pd
df1 = pd.read_csv('col1.txt', delimiter='\t', header=None)
df2 = pd.read_csv('col2.txt', delimiter='\t', header=None)
df = pd.concat([df1, df2], axis=1)
df.to_csv('col1_2.txt', sep='\t',header=False, index=False)

paste col1.txt col2.txt > col1_2.txt

14. Geben Sie N Zeilen von Anfang an aus

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (df.head(5))

head -n 5 popu

15. Geben Sie die letzten N Zeilen aus

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
print (df.tail(5))

tail -n 5 popular-names.txt

16. Teilen Sie die Datei in N.

N = 3
import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
step = - (-len(df) // N)
for n in range(N):
    df_split = df.iloc[n*step:(n+1)*step]
    df_split.to_csv('popular-names'+str(n)+'.txt', sep='\t',header=False, index=False)

split -n 3 popuar-names.txt

17. Unterschied in der Zeichenfolge in der ersten Spalte

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
new = df[0].unique()
new.sort()
print (new)

cut -f 1  popular-names.txt | sort | uniq

18. Sortieren Sie jede Zeile in absteigender Reihenfolge der Zahlen in der dritten Spalte

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
new = df[2].sort_values(ascending=False)
print (new)

cut -f 3  popular-names.txt | sort -n -r

19. Ermitteln Sie die Häufigkeit des Auftretens der Zeichenfolge in der ersten Spalte jeder Zeile und ordnen Sie sie in absteigender Reihenfolge der Häufigkeit des Auftretens an.

import pandas as pd
df = pd.read_csv('popular-names.txt', delimiter='\t', header=None)
vc = df[0].value_counts()
vc = pd.DataFrame(vc)
vc = vc.reset_index()
vc.columns = ['name','count']
vc = vc.sort_values(['count','name'],ascending=[False,False])
print (vc)

cut -f 1  popular-names.txt | sort | uniq -c | sort -n -r

Kapitel 3: Reguläre Ausdrücke

https://kakedashi-engineer.appspot.com/2020/04/18/nlp100-20-24/ https://kakedashi-engineer.appspot.com/2020/04/19/nlp100-25-26/ https://kakedashi-engineer.appspot.com/2020/04/20/nlp100-27-28/ https://kakedashi-engineer.appspot.com/2020/04/21/nlp100-29-30/

20. JSON-Daten lesen

import pandas as pd
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
print (uk)

21. Extrahieren Sie Zeilen mit Kategorienamen

import pandas as pd
import re
pattern = re.compile('Category')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        print (line)

22. Extraktion des Kategorienamens

import pandas as pd
import re
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        line = line.replace('[[','').replace('Category:','').replace(']]','').replace('|*','').replace('|Ehemalige','')
        print (line)

23. Abschnittsstruktur

import pandas as pd
import re
pattern = re.compile('^=+.*=+$') #Mehr als einmal=Beginnen Sie mit mehr als einmal=Zeichenfolge, die mit endet
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    if re.search(pattern, line):
        level = line.count('=') // 2 - 1
        print(line.replace('=',''), level )

24. Dateireferenzen extrahieren

import pandas as pd
import re
pattern = re.compile('File|Datei:(.+?)\|')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
for line in ls:
    r = re.findall(pattern, line)
    if r:
        print (r[0])

25. Vorlagen extrahieren

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

26. Entfernen des markierten Markups

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
    r = re.sub(p_emp,'\\1', line)
    print (r)
print (d)

27. Entfernung interner Links

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
p_link = re.compile('\[\[(.+?)\]\]')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
lines = uk[0]
lines = re.sub(p_emp,'\\1', lines)
lines = re.sub(p_link,'\\1', lines)
ls = lines.split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

28. Entfernen des MediaWiki-Markups

import pandas as pd
import re
pattern = re.compile('\|(.+?)\s=\s*(.+)')
p_emp = re.compile('\'{2,}(.+?)\'{2,}')
p_link = re.compile('\[\[(.+?)\]\]')
p_refbr = re.compile('<[br|ref][^>]*?>.+?<\/[br|ref][^>]*?>')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
lines = uk[0]
lines = re.sub(p_emp,'\\1', lines)
lines = re.sub(p_link,'\\1', lines)
lines = re.sub(p_refbr,'', lines)
ls = lines.split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
print (d)

29. Rufen Sie die URL des Flaggenbildes ab

import pandas as pd
import re
import requests
pattern = re.compile('\|(.+?)\s=\s*(.+)')
wiki = pd.read_json('jawiki-country.json.gz', lines = True)
uk = wiki[wiki['title']=='England'].text.values
ls = uk[0].split('\n')
d = {}
for line in ls:
    r = re.search(pattern, line)
    if r:
        d[r[1]]=r[2]
        
S = requests.Session()
URL = "https://commons.wikimedia.org/w/api.php"
PARAMS = {
    "action": "query",
    "format": "json",
    "titles": "File:" + d['Flaggenbild'],
    "prop": "imageinfo",
    "iiprop":"url"
}
R = S.get(url=URL, params=PARAMS)
DATA = R.json()
PAGES = DATA['query']['pages']
for k, v in PAGES.items():
    print (v['imageinfo'][0]['url'])

Kapitel 4: Morphologische Analyse

https://kakedashi-engineer.appspot.com/2020/04/22/nlp100-31-34/ https://kakedashi-engineer.appspot.com/2020/04/22/nlp100-35-39/

30. Lesen der Ergebnisse der morphologischen Analyse

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)

31. Verb

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
[d['surface'] for d in result if d['pos'] == 'Verb' ]

32. Prototyp des Verbs

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
[d['base'] for d in result if d['pos'] == 'Verb' ]

33. "B von A"

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
noun_phrase = []
for i in range(len(result)-2):
    if (result[i]['pos'] == 'Substantiv' and result[i+1]['surface'] == 'von' and result[i+2]['pos'] == 'Substantiv'):
        noun_phrase.append(result[i]['surface']+result[i+1]['surface']+result[i+2]['surface'])
print (noun_phrase)

34. Verkettung der Nomenklatur

import MeCab
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)

ls_noun = []
noun = ''
for d in result:
    if d['pos']=='Substantiv':
            noun += d['surface']
    else:
        if noun != '':
            ls_noun.append(noun)
            noun = ''
else:
    if noun != '':
        ls_noun.append(noun)
        noun = ''
print (ls_noun)

35. Häufigkeit des Auftretens von Wörtern

import MeCab
from collections import Counter
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
print (c.most_common())

36. Top 10 der häufigsten Wörter

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
target = list(zip(*c.most_common(10)))
plt.bar(*target)
plt.show()

37. Top 10 Wörter, die häufig zusammen mit "Katze" vorkommen

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
tmp_cooccurrence = []
cooccurrence = []
inCat = False

for line in text[:-1]:
    if line == 'EOS':
        if inCat:
            cooccurrence.extend(tmp_cooccurrence)
        else:
            pass
        tmp_cooccurrence = []
        inCat = False
        continue
            
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
    if ls[0]!='Katze':
        tmp_cooccurrence.append(ls[0])
    else:
        inCat = True


c = Counter(cooccurrence)
target = list(zip(*c.most_common(10)))
plt.bar(*target)
plt.show()

38. Histogramm

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)


surface =  [d['surface'] for d in result]
c = Counter(surface)
plt.hist(c.values(),  range = (1,10))
plt.show()

39. Zipfs Gesetz

import MeCab
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
plt.rcParams['font.family'] = 'AppleGothic'
path = 'neko.txt.mecab'
with open(path) as f:
    text = f.read().split('\n')
result = []
for line in text[:-1]:
    if line == 'EOS':
        continue
    ls = line.split('\t')
    d = {}
    tmp = ls[1].split(',')
    d = {'surface':ls[0], 'base':tmp[6], 'pos':tmp[0], 'pos1':tmp[1]}
    result.append(d)
surface =  [d['surface'] for d in result]
c = Counter(surface)
v = [kv[1] for kv in c.most_common()]
plt.scatter(np.log(range(len(v))),np.log(v))
plt.show()

Fortsetzung

Sprachverarbeitung 100 Klopfen 2020 [00 ~ 49 Antwort]

[PYTHON] 100 Sprachverarbeitung klopfen 2020 [00 ~ 39 Antwort]

Fortsetzung

Kapitel 1: Vorbereitende Bewegung

00. Umgekehrte Reihenfolge der Zeichenfolgen

01. "Patatokukashi"

02. "Patcar" + "Tax" = "Patatokukasie"

03. Umfangsrate

04. Elementsymbol

06. Treffen

07. Anweisungsgenerierung nach Vorlage

08. Kryptographie

Kapitel 2: UNIX-Befehle

10. Zählen der Anzahl der Zeilen

11. Ersetzen Sie die Registerkarten durch Leerzeichen

12. Speichern Sie die erste Spalte in col1.txt und die zweite Spalte in col2.txt

13. Führen Sie col1.txt und col2.txt zusammen

14. Geben Sie N Zeilen von Anfang an aus

15. Geben Sie die letzten N Zeilen aus

16. Teilen Sie die Datei in N.

17. Unterschied in der Zeichenfolge in der ersten Spalte

18. Sortieren Sie jede Zeile in absteigender Reihenfolge der Zahlen in der dritten Spalte

19. Ermitteln Sie die Häufigkeit des Auftretens der Zeichenfolge in der ersten Spalte jeder Zeile und ordnen Sie sie in absteigender Reihenfolge der Häufigkeit des Auftretens an.

Kapitel 3: Reguläre Ausdrücke

20. JSON-Daten lesen

21. Extrahieren Sie Zeilen mit Kategorienamen

22. Extraktion des Kategorienamens

23. Abschnittsstruktur

24. Dateireferenzen extrahieren

25. Vorlagen extrahieren

26. Entfernen des markierten Markups

27. Entfernung interner Links

28. Entfernen des MediaWiki-Markups

29. Rufen Sie die URL des Flaggenbildes ab

Kapitel 4: Morphologische Analyse

30. Lesen der Ergebnisse der morphologischen Analyse

31. Verb

32. Prototyp des Verbs

33. "B von A"

34. Verkettung der Nomenklatur

35. Häufigkeit des Auftretens von Wörtern

36. Top 10 der häufigsten Wörter

37. Top 10 Wörter, die häufig zusammen mit "Katze" vorkommen

38. Histogramm

39. Zipfs Gesetz

Fortsetzung