[PYTHON] [Language processing 100 knocks 2020] Chapter 7: Word vector

Introduction

2020 version of 100 knocks of language processing, which is famous as a collection of problems of natural language processing, has been released. This article summarizes the results of solving "Chapter 7: Word Vectors" from Chapters 1 to 10 below.

-Chapter 1: Preparatory Movement -Chapter 2: UNIX Commands -Chapter 3: Regular Expressions -Chapter 4: Morphological analysis -Chapter 5: Dependency Analysis -Chapter 6: Machine Learning --Chapter 7: Word Vector -Chapter 8: Neural Net --Chapter 9: RNN, CNN --Chapter 10: Machine Translation

Advance preparation

We use Google Colaboratory for answers. For details on how to set up and use Google Colaboratory, see this article. The notebook containing the execution results of the following answers is available on github.

Chapter 7: Word Vector

Create a program that performs the following processing for a word vector (word embedding) that expresses the meaning of a word as a real vector.

60. Reading and displaying word vectors

Download the learned word vector (3 million words / phrases, 300 dimensions) in the Google News dataset (about 100 billion words) and display the word vector of "United States". However, note that "United States" is internally referred to as "United_States".

First, download the specified learned word vector.

FILE_ID = "0B7XkCwpI5KDYNlNUTTlSS21pQmM"
FILE_NAME = "GoogleNews-vectors-negative300.bin.gz"
!wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=$FILE_ID' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=$FILE_ID" -O $FILE_NAME && rm -rf /tmp/cookies.txt

It then reads the word vector using Gensim, which is used in various tasks of natural language processing.

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('./GoogleNews-vectors-negative300.bin.gz', binary=True)

After reading, you can easily get a word vector just by specifying the word you want to vectorize.

model['United_States']

`output`


array([-3.61328125e-02, -4.83398438e-02,  2.35351562e-01,  1.74804688e-01,
       -1.46484375e-01, -7.42187500e-02, -1.01562500e-01, -7.71484375e-02,
        1.09375000e-01, -5.71289062e-02, -1.48437500e-01, -6.00585938e-02,
        1.74804688e-01, -7.71484375e-02,  2.58789062e-02, -7.66601562e-02,
       -3.80859375e-02,  1.35742188e-01,  3.75976562e-02, -4.19921875e-02,
・ ・ ・

61. Word similarity

Calculate the cosine similarity between "United States" and "U.S.".

Here, we will use the similarity method. If you specify two words, you can calculate the cosine similarity between the words.

model.similarity('United_States', 'U.S.')

`output`


0.73107743

62. 10 words with high similarity

Output 10 words with high cosine similarity to "United States" and their similarity.

Here we use the most_similar method. If you specify a word, you can get the top words of similarity up to topn and their similarity.

model.most_similar('United_States', topn=10)

`output`


[('Unites_States', 0.7877248525619507),
 ('Untied_States', 0.7541370391845703),
 ('United_Sates', 0.74007248878479),
 ('U.S.', 0.7310774326324463),
 ('theUnited_States', 0.6404393911361694),
 ('America', 0.6178410053253174),
 ('UnitedStates', 0.6167312264442444),
 ('Europe', 0.6132988929748535),
 ('countries', 0.6044804453849792),
 ('Canada', 0.6019070148468018)]

63. Analogy by additive construct

Subtract the "Madrid" vector from the "Spain" word vector, calculate the vector by adding the "Athens" vector, and output 10 words with high similarity to that vector and their similarity.

The `most_similar``` method used in the previous question can acquire words that have a high degree of similarity to the calculated vector after specifying the vector to be added and the vector to be subtracted. Here, following the instructions in the question sentence, we are displaying words that are highly similar to the vector of `Spain- Madrid+Athens `` Greece is in first place.

vec = model['Spain'] - model['madrid'] + model['Athens'] 
model.most_similar(positive=['Spain', 'Athens'], negative=['Madrid'], topn=10)

`output`


[('Greece', 0.6898481249809265),
 ('Aristeidis_Grigoriadis', 0.5606848001480103),
 ('Ioannis_Drymonakos', 0.5552908778190613),
 ('Greeks', 0.545068621635437),
 ('Ioannis_Christou', 0.5400862693786621),
 ('Hrysopiyi_Devetzi', 0.5248444676399231),
 ('Heraklio', 0.5207759737968445),
 ('Athens_Greece', 0.516880989074707),
 ('Lithuania', 0.5166866183280945),
 ('Iraklion', 0.5146791934967041)]

64. Experiments with analogy data

Download the evaluation data of the word analogy, calculate vec (word in the second column) --vec (word in the first column) + vec (word in the third column), and find the word with the highest similarity to the vector. ， Find the similarity. Add the obtained word and similarity to the end of each case.

Downloads the specified data.

!wget http://download.tensorflow.org/data/questions-words.txt

#Check the first 10 lines
!head -10 questions-words.txt

`output`


: capital-common-countries
Athens Greece Baghdad Iraq
Athens Greece Bangkok Thailand
Athens Greece Beijing China
Athens Greece Berlin Germany
Athens Greece Bern Switzerland
Athens Greece Cairo Egypt
Athens Greece Canberra Australia
Athens Greece Hanoi Vietnam
Athens Greece Havana Cuba

This data includes a set for evaluating semantic analogies, such as (Athens-Greece, Tokyo-Japan), and a set for evaluating grammatical analogies, such as (walk-walks, write-writes). I will. It consists of the following 14 categories in total, the above 5 correspond to the former and the others correspond to the latter.

No.	category
1	capital-common-countries
2	capital-world
3	currency
4	city-in-state
5	family
6	gram1-adjective-to-adverb
7	gram2-opposite
8	gram3-comparative
9	gram4-superlative
10	gram5-present-participle
11	gram6-nationality-adjective
12	gram7-past-tense
13	gram8-plural
14	gram9-plural-verbs

It reads line by line, calculates the similarity with the specified word, and outputs the formatted data.

with open('./questions-words.txt', 'r') as f1, open('./questions-words-add.txt', 'w') as f2:
  for line in f1:  #Read line by line from f1, add the desired word and similarity, and write to f2
    line = line.split()
    if line[0] == ':':
      category = line[1]
    else:
      word, cos = model.most_similar(positive=[line[1], line[2]], negative=[line[0]], topn=1)[0]
      f2.write(' '.join([category] + line + [word, str(cos) + '\n']))

!head -10 questions-words-add.txt

`output`


capital-common-countries Athens Greece Baghdad Iraq Iraqi 0.6351870894432068
capital-common-countries Athens Greece Bangkok Thailand Thailand 0.7137669324874878
capital-common-countries Athens Greece Beijing China China 0.7235777974128723
capital-common-countries Athens Greece Berlin Germany Germany 0.6734622120857239
capital-common-countries Athens Greece Bern Switzerland Switzerland 0.4919748306274414
capital-common-countries Athens Greece Cairo Egypt Egypt 0.7527809739112854
capital-common-countries Athens Greece Canberra Australia Australia 0.583732545375824
capital-common-countries Athens Greece Hanoi Vietnam Viet_Nam 0.6276341676712036
capital-common-countries Athens Greece Havana Cuba Cuba 0.6460992097854614
capital-common-countries Athens Greece Helsinki Finland Finland 0.6899983882904053

65. Correct answer rate in analogy tasks

Measure the accuracy rate of the semantic analogy and the syntactic analogy using the execution results of> 64.

Calculate for each corresponding category.

with open('./questions-words-add.txt', 'r') as f:
  sem_cnt = 0
  sem_cor = 0
  syn_cnt = 0
  syn_cor = 0
  for line in f:
    line = line.split()
    if not line[0].startswith('gram'):
      sem_cnt += 1
      if line[4] == line[5]:
        sem_cor += 1
    else:
      syn_cnt += 1
      if line[4] == line[5]:
        syn_cor += 1

print(f'Semantic analogy correct answer rate: {sem_cor/sem_cnt:.3f}')
print(f'Literary analogy correct answer rate: {syn_cor/syn_cnt:.3f}')

`output`


Semantic analogy correct answer rate: 0.731
Literary analogy correct answer rate: 0.740

66. Evaluation by WordSimilarity-353

Download the evaluation data of The WordSimilarity-353 Test Collection and calculate the Spearman correlation coefficient between the ranking of similarity calculated by word vector and the ranking of human similarity judgment.

This data is given a human-rated similarity to a pair of words. Calculate the similarity of the word vectors for each pair and the Spearman's rank correlation coefficient for both.

!wget http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.zip
!unzip wordsim353.zip

`output`


Archive:  wordsim353.zip
  inflating: combined.csv            
  inflating: set1.csv                
  inflating: set2.csv                
  inflating: combined.tab            
  inflating: set1.tab                
  inflating: set2.tab                
  inflating: instructions.txt

!head -10 './combined.csv'

`output`


Word 1,Word 2,Human (mean)
love,sex,6.77
tiger,cat,7.35
tiger,tiger,10.00
book,paper,7.46
computer,keyboard,7.62
computer,internet,7.58
plane,car,5.77
train,car,6.31
telephone,communication,7.50

ws353 = []
with open('./combined.csv', 'r') as f:
  next(f)
  for line in f:  #Read line by line and calculate word vector and similarity
    line = [s.strip() for s in line.split(',')]
    line.append(model.similarity(line[0], line[1]))
    ws353.append(line)

#Verification
for i in range(5):
  print(ws353[i])

`output`


['love', 'sex', '6.77', 0.2639377]
['tiger', 'cat', '7.35', 0.5172962]
['tiger', 'tiger', '10.00', 0.99999994]
['book', 'paper', '7.46', 0.3634626]
['computer', 'keyboard', '7.62', 0.39639163]

import numpy as np
from scipy.stats import spearmanr

#Calculation of Spearman's correlation coefficient
human = np.array(ws353).T[2]
w2v = np.array(ws353).T[3]
correlation, pvalue = spearmanr(human, w2v)

print(f'Spearman correlation coefficient: {correlation:.3f}')

`output`


Spearman correlation coefficient: 0.685

67. k-means clustering

Extract the word vector related to the country name and execute k-means clustering with the number of clusters k = 5.

Since we could not find a suitable source for the list of country names, we are collecting it from the evaluation data of the word analogy.

#Acquisition of country name
countries = set()
with open('./questions-words-add.txt') as f:
  for line in f:
    line = line.split()
    if line[0] in ['capital-common-countries', 'capital-world']:
      countries.add(line[2])
    elif line[0] in ['currency', 'gram6-nationality-adjective']:
      countries.add(line[1])
countries = list(countries)

#Get word vector
countries_vec = [model[country] for country in countries]

from sklearn.cluster import KMeans

# k-means clustering
kmeans = KMeans(n_clusters=5)
kmeans.fit(countries_vec)
for i in range(5):
    cluster = np.where(kmeans.labels_ == i)[0]
    print('cluster', i)
    print(', '.join([countries[k] for k in cluster]))

`output`


cluster 0
Taiwan, Afghanistan, Iraq, Lebanon, Indonesia, Turkey, Egypt, Libya, Syria, Korea, China, Nepal, Cambodia, India, Bhutan, Qatar, Laos, Malaysia, Iran, Vietnam, Oman, Bahrain, Pakistan, Thailand, Bangladesh, Morocco, Jordan, Israel
cluster 1
Madagascar, Uganda, Botswana, Guinea, Malawi, Tunisia, Nigeria, Mauritania, Kenya, Zambia, Algeria, Mozambique, Ghana, Niger, Somalia, Angola, Mali, Senegal, Sudan, Zimbabwe, Gambia, Eritrea, Liberia, Burundi, Gabon, Rwanda, Namibia
cluster 2
Suriname, Uruguay, Tuvalu, Nicaragua, Colombia, Belize, Venezuela, Ecuador, Fiji, Peru, Guyana, Jamaica, Brazil, Honduras, Samoa, Bahamas, Dominica, Philippines, Cuba, Chile, Mexico, Argentina
cluster 3
Netherlands, Sweden, USA, Ireland, Canada, Spain, Malta, Greenland, Europe, Greece, France, Austria, Norway, Finland, Australia, Japan, Iceland, England, Italy, Denmark, Belgium, Switzerland, Germany, Portugal, Liechtenstein
cluster 4
Croatia, Belarus, Uzbekistan, Latvia, Tajikistan, Slovakia, Ukraine, Hungary, Albania, Poland, Montenegro, Georgia, Russia, Kyrgyzstan, Armenia, Romania, Cyprus, Lithuania, Azerbaijan, Serbia, Slovenia, Turkmenistan, Moldova, Bulgaria, Estonia, Kazakhstan, Macedonia

68. Ward's method clustering

Execute hierarchical clustering by Ward's method for word vectors related to country names. Furthermore, visualize the clustering result as a dendrogram.

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage

plt.figure(figsize=(15, 5))
Z = linkage(countries_vec, method='ward')
dendrogram(Z, labels=countries)
plt.show()

69. Visualization by t-SNE

Visualize the vector space of word vectors related to country names with t-SNE.

Compress the word vector in two dimensions with t-SNE and visualize it in a scatter plot.

!pip install bhtsne

import bhtsne

embedded = bhtsne.tsne(np.array(countries_vec).astype(np.float64), dimensions=2, rand_seed=123)
plt.figure(figsize=(10, 10))
plt.scatter(np.array(embedded).T[0], np.array(embedded).T[1])
for (x, y), name in zip(embedded, countries):
    plt.annotate(name, (x, y))
plt.show()

in conclusion

100 Language Processing Knock is designed so that you can learn not only natural language processing itself, but also basic data processing and general-purpose machine learning. Even those who are studying machine learning in online courses will be able to practice very good output, so please try it.

To all 100 questions