[PYTHON] I tried to summarize SparseMatrix

This article is the 15th day article of "Machine Learning Advent Calendar 2015". Here is a brief summary of the "Sparse Matrix" that is often used when implementing Machine Learning.

SparseMatrix

Unlike the general DenseMatrix, SparseMatrix is often used when the number of Non-zero Values (non-zero values) is extremely small relative to the length of the vector. SparseMatrix is superior to DenseMatrix in the following ways:

Consume less memory
Speed up matrix / vector operations

These benefits are very effective when the size of the matrix or vector is large enough and the number of Non-zero Values is small. For example, when multiple mass variables are represented by 1-of-K. This time, for the sake of simplicity, we will look at various libraries using vector operations instead of matrices.

scipy.sparse

Probably the most famous one described here. It is also available in scikit-learn, the most famous Python machine learning library. For example, in the case of sklearn.clustering.KMeans, it is written here. As you are,

X : array-like or sparse matrix

It is possible to use SparseMatrix for X like this. There are many ways to define a SparseMatrix. See the Official Documentation (http://docs.scipy.org/doc/scipy/reference/sparse.html) for more information.

Examples

Let's see how much difference there is between Dense and Sparse in the actual calculation.

>>> import scipy.sparse as sp
>>> import numpy as np
>>> #Length 1,000 vectors
>>> x1=np.zeros(10**3)
>>> x1[10]=5; x1[100]=10;
>>> y1=sp.lil_matrix(x1).tocsr()
>>> %time x1 * x1 %Element product
21 μs

>>> %time y1.multiply(y1)
249 μs

>>> #Length 1,000,000 vectors
>>> x2=np.zeros(10**6)
>>> x2[10]=5; x2[100]=10;
>>> y2=sp.lil_matrix(x2).tocsr()
>>> %time x2 * x2 %Element product
4.15 ms

>>> %time y2.multiply(y2)
250 μs

As shown in the above example, the calculation time is obviously different as the vector length increases. By the way, in the above example, it is an element product, but in the case of addition, it is faster than DenseVector.

important point

Processing is faster only when calculating with multiply ()
When doing vector product, it is better to do as follows

#In this case, Sparse Operation is not performed
%time y2 * y2.T
3.41 ms

#This is the Sparse Operatio
%time y2.multiply(y2).sum()
447 µs

Other libraries besides scipy

Theano You can load and use theano.sparse package. Csr and csc as they are called in scipy.sparse are prepared.

>>> from theano import sparse

It's easy, but I'll try the same example above with Theano.

>>> import theano
>>> from theano import sparse
>>> x = sparse.csr_matrix(name='x', dtype='float64')
>>> f = theano.function([x], sparse.basic.mul(x, x))
>>> %time f(y1)
312 µs

Of course, when operating with ordinary vectors,

>>> import theano.tensor as T
>>> x = T.dvector(name='x')
>>> f = theano.function([x], x * x)
>>> %time f(x1)
4.74 ms

It takes a long time to calculate. I think it will be useful to remember in a library like Theano where functions can be used flexibly and differentiation can be easily calculated.

However, as a reminder

Use the operations in sparse.basic for function
theano.sparse doesn't seem to support GPU

TensorFlow We provide the much-talked-about TensorFlowSparseTensor.

>>> import tensorflow as tf
>>> tf.SparseTensor(values=[1, 2], indices=[[0, 0], [1, 2]], shape=[3, 4])
[[1, 0, 0, 0]
 [0, 0, 2, 0]
 [0, 0, 0, 0]]

I did some research on this, but I didn't know in detail how to use it. (I will update it at a later date)

in conclusion

It's easy, but I've summarized SparseMatrix. I'm still studying, so it's okay if I don't understand, so if you have any suggestions, I'd appreciate it if you could comment. When implementing machine learning, especially for online learning, it is effective, so why not consider implementing SparseMatrix as well as DenseMatrix when implementing it! ??