[PYTHON] 100 Language Processing Knock-88: 10 Words with High Similarity

This is the record of the 88th "10 words with high similarity" of Language Processing 100 Knock 2015. Extract similar guys from every word. This is also the process you want to do from your mailbox or minutes. Technically, it is almost the same as the previous content.

Reference link

Link Remarks
088.10 words with high similarity.ipynb Answer program GitHub link
100 amateur language processing knocks:88 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
numpy 1.17.4
pandas 0.25.3

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

88. 10 words with high similarity

Read the meaning vector of the word obtained in> 85, and output 10 words with high cosine similarity to "England" and their similarity.

Answer

Answer Program [088. 10 words with high similarity.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83] % 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 088.% E9% A1% 9E% E4% BC% BC% E5% BA % A6% E3% 81% AE% E9% AB% 98% E3% 81% 84% E5% 8D% 98% E8% AA% 9E10% E4% BB% B6.ipynb)

import numpy as np
import pandas as pd

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

# 'Read England word vector and calculate norm
v1 = matrix_x300[group_t.index.get_loc('England')]
v1_norm = np.linalg.norm(v1)


#Cosine similarity calculation
def get_cos_similarity(v2):
    
    #If the vectors are all zero-Returns 1
    if np.count_nonzero(v2) == 0:
        return -1
    else:
        return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))

cos_sim = [get_cos_similarity(matrix_x300[i]) for i in range(len(group_t))]
print('Cosign Similarity result length:', len(cos_sim))

#Sort by leaving index
cos_sim_sorted = np.argsort(cos_sim)

#From the very end of the array sorted in ascending order-11(-12)Output up to one by one(Top is England himself)
for index in cos_sim_sorted[:-12:-1]:
    print('{}\t{}'.format(group_t.index[index], cos_sim[index]))

Answer commentary

The cosine similarity calculation part is made into a function. Judging by the count_nonzero function, -1 is returned when all the vectors are zero.

#Cosine similarity calculation
def get_cos_similarity(v2):
    
    #If the vectors are all zero-Returns 1
    if np.count_nonzero(v2) == 0:
        return -1
    else:
        return np.dot(v1, v2) / (v1_norm * np.linalg.norm(v2))

The result is obtained at once with the inclusion notation for the array.

cos_sim = [get_cos_similarity(matrix_x300[i]) for i in range(len(group_t))]

For the above calculation, I thought that it would be faster to use this ʻapply_along_axis` for numpy, but it was rather slow, so it is not adopted.

cos_sim = np.apply_along_axis(get_cos_similarity, 1, matrix_x300)

This is the final output result. Scotland and Italy are at the top. It is surprising that there is also Japan. Is it because it is an island country?

England	1.0000000000000002
Scotland	0.6364961613062289
Italy	0.6033905306935802
Wales	0.5961887337227456
Australia	0.5953277272306978
Spain	0.5752511915429617
Japan	0.5611603300967408
France	0.5547284075334182
Germany	0.5539239745925412
United_Kingdom	0.5225684232409136
Cheshire	0.5125286144779688

Recommended Posts

100 Language Processing Knock-88: 10 Words with High Similarity
100 Language Processing with Python Knock 2015
100 Language Processing Knock-87: Word Similarity
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock 2020 with GiNZA v3.1 Chapter 4
100 Language Processing Knock with Python (Chapter 2, Part 1)
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 Amateur Language Processing Knock: 67
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock-80 (Replace with Regular Expression): Corpus Formatting
100 language processing knock-50: sentence break
100 language processing knock-81 (batch replacement): Dealing with country names consisting of compound words
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
3. Natural language processing with Python 4-1. Analysis for words with KWIC
100 Language Processing Knock-25: Template Extraction
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Natural language processing 2 Word similarity
100 Amateur Language Processing Knock: Summary
Study natural language processing with Kikagaku
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
[Natural language processing] Preprocessing with Japanese
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
100 Language Processing Knock Chapter 4: Morphological Analysis
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 7: Word Vector