This article is the 15th day article of "Machine Learning Advent Calendar 2015". Here is a brief summary of the "Sparse Matrix" that is often used when implementing Machine Learning.
SparseMatrix
Unlike the general DenseMatrix, SparseMatrix is often used when the number of Non-zero Values (non-zero values) is extremely small relative to the length of the vector. SparseMatrix is superior to DenseMatrix in the following ways:
These benefits are very effective when the size of the matrix or vector is large enough and the number of Non-zero Values is small. For example, when multiple mass variables are represented by 1-of-K. This time, for the sake of simplicity, we will look at various libraries using vector operations instead of matrices.
scipy.sparse
Probably the most famous one described here. It is also available in scikit-learn, the most famous Python machine learning library. For example, in the case of sklearn.clustering.KMeans, it is written here. As you are,
X : array-like or sparse matrix
It is possible to use SparseMatrix for X like this. There are many ways to define a SparseMatrix. See the Official Documentation (http://docs.scipy.org/doc/scipy/reference/sparse.html) for more information.
Examples
Let's see how much difference there is between Dense and Sparse in the actual calculation.
>>> import scipy.sparse as sp
>>> import numpy as np
>>> #Length 1,000 vectors
>>> x1=np.zeros(10**3)
>>> x1[10]=5; x1[100]=10;
>>> y1=sp.lil_matrix(x1).tocsr()
>>> %time x1 * x1 %Element product
21 μs
>>> %time y1.multiply(y1)
249 μs
>>> #Length 1,000,000 vectors
>>> x2=np.zeros(10**6)
>>> x2[10]=5; x2[100]=10;
>>> y2=sp.lil_matrix(x2).tocsr()
>>> %time x2 * x2 %Element product
4.15 ms
>>> %time y2.multiply(y2)
250 μs
As shown in the above example, the calculation time is obviously different as the vector length increases. By the way, in the above example, it is an element product, but in the case of addition, it is faster than DenseVector.
#In this case, Sparse Operation is not performed
%time y2 * y2.T
3.41 ms
#This is the Sparse Operatio
%time y2.multiply(y2).sum()
447 µs
Theano You can load and use theano.sparse package. Csr and csc as they are called in scipy.sparse are prepared.
>>> from theano import sparse
It's easy, but I'll try the same example above with Theano.
>>> import theano
>>> from theano import sparse
>>> x = sparse.csr_matrix(name='x', dtype='float64')
>>> f = theano.function([x], sparse.basic.mul(x, x))
>>> %time f(y1)
312 µs
Of course, when operating with ordinary vectors,
>>> import theano.tensor as T
>>> x = T.dvector(name='x')
>>> f = theano.function([x], x * x)
>>> %time f(x1)
4.74 ms
It takes a long time to calculate. I think it will be useful to remember in a library like Theano where functions can be used flexibly and differentiation can be easily calculated.
However, as a reminder
TensorFlow We provide the much-talked-about TensorFlowSparseTensor.
>>> import tensorflow as tf
>>> tf.SparseTensor(values=[1, 2], indices=[[0, 0], [1, 2]], shape=[3, 4])
[[1, 0, 0, 0]
[0, 0, 2, 0]
[0, 0, 0, 0]]
I did some research on this, but I didn't know in detail how to use it. (I will update it at a later date)
It's easy, but I've summarized SparseMatrix. I'm still studying, so it's okay if I don't understand, so if you have any suggestions, I'd appreciate it if you could comment. When implementing machine learning, especially for online learning, it is effective, so why not consider implementing SparseMatrix as well as DenseMatrix when implementing it! ??
Recommended Posts