In Numpy, using loops is slow, so various techniques are used. It's easy to write in vector or matrix format, but there are some that use broadcast. An example is the generation of a distance matrix. Given that the locations of multiple locations are stored in x
, how can we calculate the distance between them?
I've been familiar with Fortran for many years, so I want to write using loops.
do j = 1, n
do i = 1, n
r(i, j) = sqrt((x(i) - x(j))**2)
end do
end do
Translated literally into Python
for i in range(len(x)):
for j in range(len(x)):
r[i, j] = math.sqrt((x[i] - x[j])**2)
It will be. Even with ndarray, there is no difference if the length is about 10, but rather it is faster to keep the list.
Numpy's array operations (including products) are basically element-by-element. Operations on arrays of the same shape, such as 4x3 two-dimensional arrays, are performed on each element. For items with different sizes, such as a 4x3 2D array and a 1x3 1D array, the calculation is performed assuming that the rows of the 1D array are repeated so that the sizes are the same. This conversion is called "broadcast".
For the x
that stores the position, it seems that an array will be returned if the operation with the transpose is performed, but that is not the case. It will be regarded as the same shape and will be a list of 0.
To use broadcast to generate the desired distance matrix without loops, increase one dimension. Use np.newaxis
to increase the dimension. The same is true for None
. The length of the dimension for which np.newaxis
or None
is specified is 1. Although I have never used it, Fortran 95 has a built-in function called spread
.
np.sqrt((x - x[:, np.newaxis])**2)
It may be an operation of nx1 and 1xn.
np.sqrt((x[np.newaxis, :] - x[:, np.newaxis])**2)
In my environment, when measured with % timeit
for 10 points, the loop of ndarray was 69.1 μs, the loop of list was 54.1 μs, and the one that avoided the loop and increased the dimension of the matrix was 3.04 μs. At 1000 points, the loop of ndarray was 660 ms, and the one with increased matrix dimensions was 2.47 ms.
Using ...
is the same as arranging :
so that the number of dimensions is the dimension of the array. The Euclidean distance of a point in n-dimensional space can be calculated as follows.
np.sqrt(((x[..., np.newaxis, :] - x[..., :, np.newaxis])**2).sum(axis=0))
Recommended Posts