[PYTHON] 100 language processing knock 2020 [00 ~ 69 answer]

This article is a continuation of Language Processing 100 Knock 2020 [Chapter 6: Machine Learning Answers].

This article deals with machine learning in Chapter 7 (60-69).

Link

I've included only the code in this article. Please refer to the link below for supplements on problem sentences and how to solve them.

Language Processing 100 Knock 2020 Chapter 7: Word Vector

Chapter 7: Word Vector

60. Reading and displaying word vectors

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)
model['United_States']

61. Word similarity

model.similarity('United_States','U.S.')

62. 10 words with high similarity

model.most_similar('United_States',topn=10)

63. Analogy by additive construct

model.most_similar(positive=['Spain','Athens'], negative=['Madrid'],topn=10)

64. Experiments with analogy data

with open('questions-words.txt') as f:
    questions = f.readlines()
with open('64.txt','w') as f:
    for i,question in enumerate(questions):
        words = question.split()
        if len(words)==4:
            ans = model.most_similar(positive=[words[1],words[2]], negative=[words[0]],topn=1)[0]
            words += [ans[0], str(ans[1])]
            output = ' '.join(words)+'\n'
        else:
            output = question
        f.write(output)
        if (i%100==0):
            print (i)

65. Correct answer rate in analogy tasks

cnt = 0
ok = 0
with open('64.txt','r') as f:
    questions = f.readlines()
for question in questions:
    words = question.split()
    if len(words)==6:
        cnt += 1
        if (words[3]==words[4]):
            ok +=1
print (ok/cnt)

66. Evaluation by WordSimilarity-353

import pandas as pd
df = pd.read_csv('wordsim353/combined.csv')
sim = []
for i in range(len(df)):
    line = df.iloc[i]
    sim.append(model.similarity(line['Word 1'],line['Word 2']))
df['w2v'] = sim 
df[['Human (mean)', 'w2v']].corr(method='spearman')

67. k-means clustering

from sklearn.cluster import KMeans
with open('country.txt','r') as f:
    lines = f.readlines()
countries = []
for line in lines:
    country = line.split(' ')[-1].replace('\n','')
    countries.append(country)
dic = {'United States of America':'United_States', 'Russian Federation':'Russia'}
ng = 0
vec = []
target_countries = []
for c in countries:
    for k,v in dic.items():
        c = c.replace(k,v)
    c = c.replace(' ','_').replace('-','_').replace('_and_','_')
    try:
        
        vec.append(model[c])
        target_countries.append(c)
    except:
        ng += 1
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(vec)
for c,l in zip(target_countries, kmeans.labels_):
    print (c,l)

68. Ward's method clustering

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
plt.figure(figsize=(32.0, 24.0))
link = linkage(vec, method='ward')
dendrogram(link, labels=target_countries,leaf_rotation=90,leaf_font_size=10)
plt.show()

69. Visualization by t-SNE

from sklearn.manifold import TSNE
vec_embedded = TSNE(n_components=2).fit_transform(vec)
vec_embedded_t = list(zip(*vec_embedded)) #Transpose
fig, ax = plt.subplots(figsize=(16, 12))
plt.scatter(*vec_embedded_t)
for i, c in enumerate(target_countries):
    ax.annotate(c, (vec_embedded[i][0],vec_embedded[i][1]))

Recommended Posts

100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 language processing knock 2020 [00 ~ 49 answer]
100 language processing knock 2020 [00 ~ 59 answer]
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 language processing knocks 2020 [00 ~ 89 answer]
100 Amateur Language Processing Knock: 07
Language processing 100 knocks 00 ~ 09 Answer
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement