[PYTHON] 100 Language Processing Knock-85 (Truncated SVD): Dimensional compression by principal component analysis

This is the record of the 85th "Dimension compression by principal component analysis" of Language processing 100 knock 2015. Compresses about 400,000 dimensions to 300 dimensions. This time, we are doing singular value decomposition instead of principal component analysis. If we were to analyze the principal components with scikit-learn, we couldn't do it with a sparse matrix, so we compromised, "Both are dimensional reductions!" Principal component analysis was learned in 8th week of the famous Coursera Machine Learning Online Course. If you are interested in the course, please refer to the article ["Coursera Machine Learning Introductory Online Course Cheat Sheet (Recommended for Humanities)" (https://qiita.com/FukuharaYohei/items/b2143413063376e97948).

Reference link

Link Remarks
085.Dimensional compression by principal component analysis.ipynb Answer program GitHub link
100 amateur language processing knocks:85 I am always indebted to you by knocking 100 language processing
TruncatedSVD Official help for Truncated SVD
About the relationship between PCA and SVD Difference between principal component analysis and singular value decomposition 1
Show the relationship between PCA and SVD Difference between principal component analysis and singular value decomposition 2

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

In the above environment, I am using the following additional Python packages. Just install with regular pip.

type version
matplotlib 3.1.1
numpy 1.17.4
pandas 0.25.3
scipy 1.4.1
scikit-learn 0.21.3

Task

Chapter 9: Vector Space Method (I)

enwiki-20150112-400-r10-105752.txt.bz2 Is the text of 105,752 articles randomly sampled 1/10 from the English Wikipedia articles as of January 12, 2015, which consist of more than 400 words, compressed in bzip2 format. is there. Using this text as a corpus, I want to learn a vector (distributed expression) that expresses the meaning of a word. In the first half of Chapter 9, principal component analysis is applied to the word context co-occurrence matrix created from the corpus, and the process of learning word vectors is implemented by dividing it into several processes. In the latter half of Chapter 9, the word vector (300 dimensions) obtained by learning is used to calculate the similarity of words and perform analogy.

Note that if problem 83 is implemented obediently, a large amount (about 7GB) of main memory is required. If you run out of memory, devise a process or 1/100 sampling corpus enwiki-20150112-400-r100-10576.txt.bz2 Use /nlp100/data/enwiki-20150112-400-r100-10576.txt.bz2).

This time * "1/100 sampling corpus [enwiki-20150112-400-r100-10576.txt.bz2](http://www.cl.ecei.tohoku.ac.jp/nlp100/data/enwiki-20150112-" 400-r100-10576.txt.bz2) ”* is used.

85. Dimensional compression by principal component analysis

Apply principal component analysis to the word context matrix obtained in> 84 and compress the word meaning vector to 300 dimensions.

Answer

Answer program [085. Dimensional compression by principal component analysis.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/09.%E3%83%99%E3%82%AF%E3%83% 88% E3% 83% AB% E7% A9% BA% E9% 96% 93% E6% B3% 95% 20 (I) / 085.% E4% B8% BB% E6% 88% 90% E5% 88% 86% E5% 88% 86% E6% 9E% 90% E3% 81% AB% E3% 82% 88% E3% 82% 8B% E6% AC% A1% E5% 85% 83% E5% 9C% A7% E7% B8% AE.ipynb)

import matplotlib.pyplot as plt
import numpy as np
from scipy import io
from sklearn.decomposition import TruncatedSVD

matrix_x = io.loadmat('084.matrix_x.mat')['x']

#Confirm reading
print('matrix_x Shape:', matrix_x.shape)
print('matrix_x Number of non-zero entries:', matrix_x.nnz)
print('matrix_x Format:', matrix_x.getformat())

#Dimensional compression
svd = TruncatedSVD(300)
matrix_x300 = svd.fit_transform(matrix_x)

print(type(matrix_x300))
print('matrix_x300 Shape:',matrix_x300.shape)

print('Explained Variance Ratio Sum:', svd.explained_variance_ratio_.sum())
ev_ratio = svd.explained_variance_ratio_
ev_ratio = np.hstack([0,ev_ratio.cumsum()])
plt.plot(ev_ratio)
plt.show()

np.savez_compressed('085.matrix_x300.npz', matrix_x300)

Answer commentary

Load the mat format file saved by knocking last time.

matrix_x = io.loadmat('084.matrix_x.mat')['x']

#Confirm reading
print('matrix_x Shape:', matrix_x.shape)
print('matrix_x Number of non-zero entries:', matrix_x.nnz)
print('matrix_x Format:', matrix_x.getformat())

In the above output, you can see that both Shape and non-zero elements are the same as last time. However, the format is csc even though I saved it as lil. Is it such a thing? I will proceed without worrying about it.

matrix_x Shape: (388836, 388836)
matrix_x Number of non-zero entries: 447875
matrix_x Format: csc

This is the main dimensional compression part. However, I'm just using the function TruncatedSVD, so I'm not having any trouble. It took about 8 minutes.

svd = TruncatedSVD(300)
matrix_x300 = svd.fit_transform(matrix_x)

print(type(matrix_x300))
print('matrix_x300 Shape:',matrix_x300.shape)

If you check the return value, it looks like numpy.ndarray format. It's true that the dimension is compressed, so the dense matrix is correct, not the sparse matrix.

<class 'numpy.ndarray'>
matrix_x300 Shape: (388836, 300)

Now let's see how well we can keep the variance. I refer to the article "Principal component analysis with scikit-learn (calculate cumulative contribution rate)".

print('Explained Variance Ratio Sum:', svd.explained_variance_ratio_.sum())
ev_ratio = svd.explained_variance_ratio_
ev_ratio = np.hstack([0,ev_ratio.cumsum()])
plt.plot(ev_ratio)
plt.show()

About 30%. Low···. Is it better to increase the number of dimensions? ["It is desirable to exceed 99%. However, it seems that 95% or 90% may be the threshold value."](Https://qiita.com/FukuharaYohei/items/7a71be58818719cdf73c#232-choosing -the-number-of-principal-components) I learned ...

Explained Variance Ratio Sum: 0.31949196039604355

This is a line graph that accumulates the variance retention ratios for each principal component. image.png

The file is compressed and saved with the function save_compressed to make it lighter. Still, the file size is 118MB. The file size saved in the previous sparse matrix was 7MB, so even if it was dimensionally compressed, it swelled due to the dense matrix. By the way, the memory usage before compression saving is 933MB, so it seems that it is made much smaller by compression. However, on the other hand, the time that took 9 seconds has increased to 36. Regarding saving, I referred to the article "Compare the difference in file size depending on the serialization method of numpy array".

np.savez_compressed('085.matrix_x300.npz', matrix_x300)

bonus

Tips: Can't you analyze the principal components?

I investigated whether "principal component analysis" could be done as described in the assignment. The bottleneck is whether sparse matrices can be used as Input. Stackoverflow's "Performing PCA on large sparse matrix by using sklearn" has a sparse matrix It was written that it was impossible. [MRG] Implement randomized PCA # 12841 seems to be implementing a function to input a sparse matrix, but it remains Open. ・ ・ ・ I thought that it would be possible to learn little by little with the amount of memory that can be overcome in a dense matrix, but even if it could be done, it would take a huge amount of time, so I gave up ...

Tips: I tried dimension 450

I felt that the dispersion retention rate of about 30% was low, so I increased the dimension. At first I did it in 1000 dimensions, but an error occurred due to insufficient memory ... With 600 dimensions, it didn't end even after 30 minutes, and it was troublesome, so I stopped halfway. It took 18 minutes for 450 dimensions. The dispersion retention rate at that time was 38%, which is considerably higher. It looks like this when compared.

dimension processing time Return value(matrix_x300)memory file size
300 dimensions 8 minutes 0.9GB 118MB
450 dimensions 18 minutes 1.40GB 178MB

Recommended Posts

100 Language Processing Knock-85 (Truncated SVD): Dimensional compression by principal component analysis
Dimensional compression with self-encoder and principal component analysis
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)
100 Language Processing Knock 2020 Chapter 4: Morphological Analysis
100 Language Processing Knock Chapter 4: Morphological Analysis
100 Language Processing Knock 2020 Chapter 5: Dependency Analysis
100 Language Processing Knock-59: Analysis of S-expressions
100 Language Processing Knock Chapter 1 by Python
100 Language Processing Knock 2015 Chapter 4 Morphological Analysis (30-39)
100 Language Processing Knock-89: Analogy by Additive Constitutiveness
Principal component analysis
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
Clustering and principal component analysis by K-means method (beginner)
100 Language Processing Knock 2020 Chapter 2
100 language processing knock-30 (using pandas): reading morphological analysis results
100 Amateur Language Processing Knock: 47
Principal component analysis (Principal component analysis: PCA)
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
Visualize the correlation matrix by principal component analysis in Python
100 Language Processing with Python Knock 2015
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 2 (Python)
Unsupervised learning 3 Principal component analysis
Natural language processing 1 Morphological analysis
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
100 Amateur Language Processing Knock: Summary
Japanese Natural Language Processing Using Python3 (4) Sentiment Analysis by Logistic Regression
[GWAS] Plot the results of principal component analysis (PCA) by PLINK
Stack processing speed comparison by language
100 Language Processing Knock 2020 Chapter 2: UNIX Commands
100 Language Processing Knock with Python (Chapter 1)
100 Language Processing Knock Chapter 1 in Python
100 language processing knocks 2020: Chapter 4 (morphological analysis)
100 Language Processing Knock 2020 Chapter 9: RNN, CNN
[Language processing 100 knocks 2020] Chapter 5: Dependency analysis
100 language processing knock-76 (using scikit-learn): labeling
100 language processing knock-55: named entity extraction
I tried 100 language processing knock 2020: Chapter 3
100 Language Processing Knock-82 (Context Word): Context Extraction
100 Language Processing Knock with Python (Chapter 3)
Face recognition using principal component analysis
100 Language Processing Knock: Chapter 1 Preparatory Movement
100 Language Processing Knock 2020 Chapter 6: Machine Learning
Language processing 100 knock-86: Word vector display