[PYTHON] Language processing 100 knock-86: Word vector display

This is the record of the 86th "Word vector display" of Language processing 100 knock 2015. This time, just display the word vector compressed to 300 dimensions with Last knock. It's very easy because you only see the results. Chapter 9: Vector Space Method (I) used to be a heavy knock, but after that it is basic. There is not much heavy processing because it only uses the result.

Reference link

Link Remarks
086.Display word vector.ipynb Answer program GitHub link
100 amateur language processing knocks:86 I am always indebted to you by knocking 100 language processing

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
numpy 1.17.4
pandas 0.25.3

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

86. Display of word vector

Read the word meaning vector obtained in> 85 and display the "United States" vector. However, note that "United States" is internally referred to as "United_States".

Answer

Answer program [086. Display word vector.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83%88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 086.% E5% 8D% 98% E8% AA% 9E% E3% 83% 99% E3% 82% AF% E3% 83% 88% E3% 83% AB% E3% 81% AE% E8% A1% A8% E7% A4% BA.ipynb)

import numpy as np
import pandas as pd

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('085.matrix_x300.npz')['arr_0']

print('matrix_x300 Shape:', matrix_x300.shape)

group_t = pd.read_pickle('./083_group_t.zip')

# 'United States'Word vector display
print(matrix_x300[group_t.index.get_loc('United_States')])

Answer commentary

Reads the npz format file saved by knocking last time. Unless otherwise specified when saving, it seems to be stored in ʻarr_0`. The reason for using the index is that the npz format can store multiple arrays together.

#I didn't specify any arguments when saving'arr_0'Stored in
matrix_x300 = np.load('085.matrix_x300.npz')['arr_0']

Since there is no word information in the array read above, ["Language processing 100 knocks-83 (using pandas): Word / context frequency measurement"](https://qiita.com/FukuharaYohei/items/ 9696afb342aa367ae5d1) Reads the target word (Target Word) information saved as a dictionary.

group_t = pd.read_pickle('./083_group_t.zip')

All you have to do is display the vector.

# 'United States'Word vector display
print(matrix_x300[group_t.index.get_loc('United_States')])

Since it is a list of numbers, it does not make much sense to write it, but the first few elements are output like this.

[ 3.54543797e+00 -7.83172862e-01  1.02182432e-01  6.22943904e+00
  2.48960832e+00 -1.19176940e+00 -2.23164453e+00  3.68785814e-01
Omitted thereafter

Recommended Posts

Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock (2020): 28
[Language processing 100 knocks 2020] Chapter 7: Word vector
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing with Python Knock 2015
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 3 Word continuity
100 Language Processing Knock-25: Template Extraction
I tried 100 language processing knock 2020
100 language processing knock-56: co-reference analysis
Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")
Natural language processing 2 Word similarity
100 Amateur Language Processing Knock: Summary
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 language processing knock-76 (using scikit-learn): labeling
100 Language Processing Knock with Python (Chapter 3)
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 language processing knock-42: Display of the phrase of the person concerned and the person concerned
100 Language Processing Knock-28: MediaWiki Markup Removal
100 Language Processing Knock 2020 Chapter 8: Neural Net
100 Language Processing Knock-59: Analysis of S-expressions
Python beginner tried 100 language processing knock 2015 (05 ~ 09)
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock 2020 "for Google Colaboratory"
I tried 100 language processing knock 2020: Chapter 1
100 Language Processing Knock 2020 Chapter 1: Preparatory Movement
100 language processing knock-73 (using scikit-learn): learning
100 Language Processing Knock Chapter 1 by Python