import timeit
import numpy as np
import scipy as sp
def getSparse(length, size, todense = False):
array = np.random.random_integers(0, size - 1, length)
response = scipy.sparse.csr_matrix(([1]*len(array), array,range(len(array) + 1)), shape=(len(array),size), dtype = array.dtype)
return response.todense() if todense else response
def testDense():
x = np.dot(np.random.rand(300000).reshape(300, 1000), getSparse(1000,300, True))
def testSparse():
x = np.dot(np.random.rand(300000).reshape(300, 1000), getSparse(1000,300, False))
print(timeit.timeit(testDense, setup = 'import __main__', number = 1))
# 0.08102297782897949
print(timeit.timeit(testSparse, setup = 'import __main__', number = 1))
# 30.572995901107788
I expected the dot operation to be faster by using a sparse matrix, but it was horribly slow. Theoretically, it shouldn't be strange if it gets faster, but it seems that the implementation does not support it.
As an aside, what I really want to do is aggregate the values of a particular dimension of the ndarray. For example, processing that changes the data whose dimension is (store x date) to (region x date). In my use case, it was fine to calculate with the dense matrix normally, but when more data is added, this characteristic seems to be a problem.