[PYTHON] 100 Language Processing Knock-84 (using pandas): Creating a word context matrix

This is the record of the 84th "Creation of word context matrix" of Language processing 100 knock 2015. Create a matrix with 160 billion elements in 400,000 x 400,000. Generally, if there are 160 billion elements, it seems to be a super huge size, but since it is a sparse matrix with almost 0, even if the matrix is saved as a file, it is about 7 MB. The scipy package that handles sparse matrices is amazing.

Reference link

Link Remarks
084.Creating a word context matrix.ipynb Answer program GitHub link
100 amateur language processing knocks:84 I am always indebted to you by knocking 100 language processing
100 language processing knock 2015 version(83,84) I referred to it in Chapter 9.

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
pandas 0.25.3
scipy 1.4.1

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

84. Creating a word context matrix

Create a word context matrix $ X $ using the output of> 83. However, each element $ X_ {tc} $ of the matrix $ X $ is defined as follows.

-If $ f (t, c) ≥10 $, then $ X_ {tc} = $ PPMI ($ t $, $ c $) $ = $ max {log $ \ frac {N × f (t, c) } {f (t, ∗) × f (∗, c)} $, 0} -If $ f (t, c) <10 $, then $ X_ {tc} = 0 $

Here, PPMI ($ t $, $ c $) is a statistic called Positive Pointwise Mutual Information. Note that the number of rows and columns of the matrix $ X $ is on the order of millions, and it is impossible to put all the elements of the matrix in the main memory. Fortunately, most of the elements in the matrix $ X $ are 0, so we only need to write out the non-zero elements.

Problem supplement

For example, suppose you created the following file for the sentence "I am a boy" in the same way as the previous knock.

I	am
I	a
am	I
am	a
a	am
a	boy
boy	am
boy	a

In contrast to the above, as an image, create the following matrix. In this example, it is a matrix that is not sparse because it is 4 columns x 4 rows, but this time it is about 400,000 columns x 400,000 rows, so it is a sparse matrix in a sparse state.

I am a boy
I 1 1
am 1 1
a 1 1
boy 1 1

The above matrix simply sets the matrix element to 1, but set the following values here. Regarding PPMI, [Article "Pointwise Mutual Information (PMI)"](https://camberbridge.github.io/2016/07/08/%E8%87%AA%E5%B7 % B1% E7% 9B% B8% E4% BA% 92% E6% 83% 85% E5% A0% B1% E9% 87% 8F-Pointwise-Mutual-Information-PMI-% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6 /) is easy to understand.

-If $ f (t, c) ≥10 $, then $ X_ {tc} = $ PPMI ($ t $, $ c $) $ = $ max {log $ \ frac {N × f (t, c) } {f (t, ∗) × f (∗, c)} $, 0} -If $ f (t, c) <10 $, then $ X_ {tc} = 0 $

Answer

Answer program [084. Creating a word context matrix.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83%88 % E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 084.% E5% 8D% 98% E8% AA% 9E% E6% 96% 87 % E8% 84% 88% E8% A1% 8C% E5% 88% 97% E3% 81% AE% E4% BD% 9C% E6% 88% 90.ipynb)

import math

import pandas as pd
from scipy import sparse, io

#Read the co-occurrence count of the word t and the context word c, and remove the combination of 9 or less co-occurrence counts
def read_tc():
    group_tc = pd.read_pickle('./083_group_tc.zip')
    return group_tc[group_tc > 9]

group_tc = read_tc()

group_t = pd.read_pickle('./083_group_t.zip')
group_c = pd.read_pickle('./083_group_c.zip')

matrix_x = sparse.lil_matrix((len(group_t), len(group_c)))


for ind ,v in group_tc.iteritems():
    ppmi = max(math.log((68000317 * v) / (group_t[ind[0]] * group_c[ind[1]])), 0)
    matrix_x[group_t.index.get_loc(ind[0]), group_c.index.get_loc(ind[1])] = ppmi

#Check sparse matrix
print('matrix_x Shape:', matrix_x.shape)
print('matrix_x Number of non-zero entries:', matrix_x.nnz)
print('matrix_x Format:', matrix_x.getformat())

io.savemat('084.matrix_x.mat', {'x': matrix_x})

Answer commentary

I'm using pandas'read_pickle function to read the file saved in the last knock. You can just read the zip format as it is. However, the processing time is added by the amount of decompression (it takes about 13 seconds for the entire function). After reading, I'm throwing away records with less than 10 co-occurrence. I didn't want to keep the whole file in memory before throwing it away, so I created the function read_tc.

#Read the co-occurrence count of the word t and the context word c, and remove the combination of 9 or less co-occurrence counts
def read_tc():
    group_tc = pd.read_pickle('./083_group_tc.zip')
    return group_tc[group_tc > 9]

The remaining 2 files are read as they are because they are not truncated X times or less.

group_t = pd.read_pickle('./083_group_t.zip')
group_c = pd.read_pickle('./083_group_c.zip')

Here we are using scipy to create a sparse matrix variable matrix_x. Multiply the number of Target Words and Context Words. There are several types of sparse matrices, but since the input uses a format called lil, the matrix is created with the lil_matrix function.

matrix_x = sparse.lil_matrix((len(group_t), len(group_c)))

The following is the main part of this knock. Calculate the PPMI and set it to a sparse matrix. The part marked 68000317 sets the value obtained by the previous knock.

for ind ,v in group_tc.iteritems():
    ppmi = max(math.log((68000317 * v) / (group_t[ind[0]] * group_c[ind[1]])), 0)
    matrix_x[group_t.index.get_loc(ind[0]), group_c.index.get_loc(ind[1])] = ppmi

Expand the above PPMI calculation formula with reference to Article "100 Language Processing Knock 2015 Edition Chapter 9 Revisited (1)" I tried it with the formula, but it didn't get faster, so I rejected it (LOG_N was calculated in advance). It's rather slow (Is my way wrong?).

ppmi = max(LOG_N + math.log(v) - math.log ( group_t[ind[0]] ) - math.log( group_c[ind[1]] ), 0)

Check the created sparse matrix.

print('matrix_x Shape:', matrix_x.shape)
print('matrix_x Number of non-zero entries:', matrix_x.nnz)
print('matrix_x Format:', matrix_x.getformat())

The following information is output. The second line is the number of entries entered, but it is 400,000 x 400,000, which is 450,000 elements out of 160 billion elements, so you can see that the density is less than 0.1%.

matrix_x Shape: (388836, 388836)
matrix_x Number of non-zero entries: 447875
matrix_x Format: lil

Finally save. The extension is "mat" and it seems to be a format that can be used with MATLAB / Octave. This is the one used in the exercises of the famous Coursera machine learning introductory online course. If you are interested, please refer to ["Coursera Machine Learning Introductory Online Course Cheat Sheet (Recommended for Humanities)" (https://qiita.com/FukuharaYohei/items/b2143413063376e97948).

io.savemat('084.matrix_x.mat', {'x': matrix_x})

Tips: Memory used for variables

The memory used for each variable was like this.

variable memory Compressed file size
group_c 40MB 3MB
group_t 40MB 3MB
group_tc 64MB 104MB

The memory used is checked by inserting the following code. It is a copy and paste from Article "[Find and delete memory-hungry variables on Jupyter (IPython)]".

print("{}{: >25}{}{: >10}{}".format('|','Variable Name','|','Memory','|'))
print(" ------------------------------------ ")
for var_name in dir():
    if not var_name.startswith("_") and sys.getsizeof(eval(var_name)) > 10000: #Arrange only here
        print("{}{: >25}{}{: >10}{}".format('|',var_name,'|',sys.getsizeof(eval(var_name)),'|'))

Recommended Posts

100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock-36 (using pandas): Frequency of word occurrence
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock-31 (using pandas): Verb
100 Language Processing Knock-38 (using pandas): Histogram
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 language processing knock-20 (using pandas): reading JSON data
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 Language Processing Knock-87: Word Similarity
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-76 (using scikit-learn): labeling
Language processing 100 knock-86: Word vector display
100 Language Processing Knock 2020 Chapter 7: Word Vector
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 Language Processing Knock-93 (using pandas): Calculate the accuracy rate of analogy tasks
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 language processing knock-92 (using Gensim): application to analogy data
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
100 language processing knocks-37 (using pandas): Top 10 most frequent words
100 Language Processing with Python Knock 2015
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
Natural language processing 3 Word continuity
100 Language Processing Knock-25: Template Extraction