[PYTHON] I tried to summarize SparseMatrix

This article is the 15th day article of "Machine Learning Advent Calendar 2015". Here is a brief summary of the "Sparse Matrix" that is often used when implementing Machine Learning.

SparseMatrix

Unlike the general DenseMatrix, SparseMatrix is often used when the number of Non-zero Values (non-zero values) is extremely small relative to the length of the vector. SparseMatrix is superior to DenseMatrix in the following ways:

These benefits are very effective when the size of the matrix or vector is large enough and the number of Non-zero Values is small. For example, when multiple mass variables are represented by 1-of-K. This time, for the sake of simplicity, we will look at various libraries using vector operations instead of matrices.

scipy.sparse

Probably the most famous one described here. It is also available in scikit-learn, the most famous Python machine learning library. For example, in the case of sklearn.clustering.KMeans, it is written here. As you are,

X : array-like or sparse matrix

It is possible to use SparseMatrix for X like this. There are many ways to define a SparseMatrix. See the Official Documentation (http://docs.scipy.org/doc/scipy/reference/sparse.html) for more information.

Examples

Let's see how much difference there is between Dense and Sparse in the actual calculation.

>>> import scipy.sparse as sp
>>> import numpy as np
>>> #Length 1,000 vectors
>>> x1=np.zeros(10**3)
>>> x1[10]=5; x1[100]=10;
>>> y1=sp.lil_matrix(x1).tocsr()
>>> %time x1 * x1 %Element product
21 μs

>>> %time y1.multiply(y1)
249 μs

>>> #Length 1,000,000 vectors
>>> x2=np.zeros(10**6)
>>> x2[10]=5; x2[100]=10;
>>> y2=sp.lil_matrix(x2).tocsr()
>>> %time x2 * x2 %Element product
4.15 ms

>>> %time y2.multiply(y2)
250 μs

As shown in the above example, the calculation time is obviously different as the vector length increases. By the way, in the above example, it is an element product, but in the case of addition, it is faster than DenseVector.

important point

#In this case, Sparse Operation is not performed
%time y2 * y2.T
3.41 ms

#This is the Sparse Operatio
%time y2.multiply(y2).sum()
447 µs

Other libraries besides scipy

Theano You can load and use theano.sparse package. Csr and csc as they are called in scipy.sparse are prepared.

>>> from theano import sparse

It's easy, but I'll try the same example above with Theano.

>>> import theano
>>> from theano import sparse
>>> x = sparse.csr_matrix(name='x', dtype='float64')
>>> f = theano.function([x], sparse.basic.mul(x, x))
>>> %time f(y1)
312 µs

Of course, when operating with ordinary vectors,

>>> import theano.tensor as T
>>> x = T.dvector(name='x')
>>> f = theano.function([x], x * x)
>>> %time f(x1)
4.74 ms

It takes a long time to calculate. I think it will be useful to remember in a library like Theano where functions can be used flexibly and differentiation can be easily calculated.

However, as a reminder

TensorFlow We provide the much-talked-about TensorFlowSparseTensor.

>>> import tensorflow as tf
>>> tf.SparseTensor(values=[1, 2], indices=[[0, 0], [1, 2]], shape=[3, 4])
[[1, 0, 0, 0]
 [0, 0, 2, 0]
 [0, 0, 0, 0]]

I did some research on this, but I didn't know in detail how to use it. (I will update it at a later date)

in conclusion

It's easy, but I've summarized SparseMatrix. I'm still studying, so it's okay if I don't understand, so if you have any suggestions, I'd appreciate it if you could comment. When implementing machine learning, especially for online learning, it is effective, so why not consider implementing SparseMatrix as well as DenseMatrix when implementing it! ??

Recommended Posts

I tried to summarize SparseMatrix
I tried to summarize the umask command
Python3 standard input I tried to summarize
I tried to summarize the graphical modeling.
I tried to debug.
I tried to paste
I tried to summarize Ansible modules-Linux edition
LeetCode I tried to summarize the simple ones
I tried to learn PredNet
I tried to implement PCANet
I tried to introduce Pylint
I tried to touch jupyter
I tried to implement StarGAN (1)
I tried to summarize how to use matplotlib of python
I tried to summarize the string operations of Python
I tried to implement Deep VQE
I tried to create Quip API
I tried to touch Python (installation)
I tried to implement adversarial validation
I tried Watson Speech to Text
I tried to touch Tesla's API
I tried to implement hierarchical clustering
I tried to implement Realness GAN
I tried to move the ball
I tried to estimate the interval.
[First COTOHA API] I tried to summarize the old story
I tried to summarize the code often used in Pandas
I tried to summarize the commands often used in business
[Machine learning] I tried to summarize the theory of Adaboost
I tried to summarize SQLAlchemy briefly (There is also TIPS)
I tried to summarize how to use the EPEL repository again
I tried to create a linebot (implementation)
[Linux] I tried to summarize the command of resource confirmation system
I tried to implement Autoencoder with TensorFlow
I tried to implement permutation in Python
I tried to create a linebot (preparation)
I tried to visualize AutoEncoder with TensorFlow
I tried scraping
I tried to recognize the wake word
I tried to get started with Hy
I tried PyQ
I tried to implement PLSA in Python 2
I tried to classify text using TensorFlow
I tried adding post-increment to CPython Implementation
I tried to summarize what was output with Qiita with Word cloud
I tried to implement ADALINE in Python
I tried to let optuna solve Sudoku
I tried to estimate the pi stochastically
I tried to summarize the commands used by beginner engineers today
I tried to touch the COTOHA API
I tried papermill
I tried to implement CVAE with PyTorch
I tried to summarize everyone's remarks on slack with wordcloud (Python)
I tried to make a Web API
I tried to solve TSP with QAOA
[Python] I tried to calculate TF-IDF steadily
I tried to summarize the frequently used implementation method of pytest-mock
I tried to touch Python (basic syntax)
I tried django-slack
I tried Django
[Django-Extensions] Web development beginners tried to summarize Django-Extensions