[PYTHON] 100 language processing knock 2020 [00 ~ 69 answer]

This article is a continuation of Language Processing 100 Knock 2020 [Chapter 6: Machine Learning Answers].

This article deals with machine learning in Chapter 7 (60-69).


I've included only the code in this article. Please refer to the link below for supplements on problem sentences and how to solve them.

Language Processing 100 Knock 2020 Chapter 7: Word Vector

Chapter 7: Word Vector

60. Reading and displaying word vectors

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True)

61. Word similarity


62. 10 words with high similarity


63. Analogy by additive construct

model.most_similar(positive=['Spain','Athens'], negative=['Madrid'],topn=10)

64. Experiments with analogy data

with open('questions-words.txt') as f:
    questions = f.readlines()
with open('64.txt','w') as f:
    for i,question in enumerate(questions):
        words = question.split()
        if len(words)==4:
            ans = model.most_similar(positive=[words[1],words[2]], negative=[words[0]],topn=1)[0]
            words += [ans[0], str(ans[1])]
            output = ' '.join(words)+'\n'
            output = question
        if (i%100==0):
            print (i)

65. Correct answer rate in analogy tasks

cnt = 0
ok = 0
with open('64.txt','r') as f:
    questions = f.readlines()
for question in questions:
    words = question.split()
    if len(words)==6:
        cnt += 1
        if (words[3]==words[4]):
            ok +=1
print (ok/cnt)

66. Evaluation by WordSimilarity-353

import pandas as pd
df = pd.read_csv('wordsim353/combined.csv')
sim = []
for i in range(len(df)):
    line = df.iloc[i]
    sim.append(model.similarity(line['Word 1'],line['Word 2']))
df['w2v'] = sim 
df[['Human (mean)', 'w2v']].corr(method='spearman')

67. k-means clustering

from sklearn.cluster import KMeans
with open('country.txt','r') as f:
    lines = f.readlines()
countries = []
for line in lines:
    country = line.split(' ')[-1].replace('\n','')
dic = {'United States of America':'United_States', 'Russian Federation':'Russia'}
ng = 0
vec = []
target_countries = []
for c in countries:
    for k,v in dic.items():
        c = c.replace(k,v)
    c = c.replace(' ','_').replace('-','_').replace('_and_','_')
        ng += 1
kmeans = KMeans(n_clusters=5, random_state=0)
for c,l in zip(target_countries, kmeans.labels_):
    print (c,l)

68. Ward's method clustering

import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
plt.figure(figsize=(32.0, 24.0))
link = linkage(vec, method='ward')
dendrogram(link, labels=target_countries,leaf_rotation=90,leaf_font_size=10)

69. Visualization by t-SNE

from sklearn.manifold import TSNE
vec_embedded = TSNE(n_components=2).fit_transform(vec)
vec_embedded_t = list(zip(*vec_embedded)) #Transpose
fig, ax = plt.subplots(figsize=(16, 12))
for i, c in enumerate(target_countries):
    ax.annotate(c, (vec_embedded[i][0],vec_embedded[i][1]))

Recommended Posts

100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 language processing knock 2020 [00 ~ 49 answer]
100 language processing knock 2020 [00 ~ 59 answer]
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 language processing knocks 2020 [00 ~ 89 answer]
100 Amateur Language Processing Knock: 07
Language processing 100 knocks 00 ~ 09 Answer
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement