[PYTHON] A little niche feature introduction of faiss

Hello, this is day 23 of the ABEJA Advent Calendar.

Introduction

Do you know a library called faiss? Facebook Resarch's neighborhood search library is very easy to use and has a lot of tutorials, so I use it frequently for both business and hobbies. I will. However, such a convenient library also has its weaknesses. When I try to do something a little niche, I just say, "Well, explain the details and get a feel for the test code (-д ☆)." Screenshot_2019-12-23 facebookresearch faiss.png I learned how to use it by actually reading the test code and sometimes reading the C ++ code of the main body, so I will write it with a reminder for myself in the future.

A little niche feature introduction

Acquisition of features of faiss

In rare cases, you want to get the features that faiss holds, rather than searching for them. For example, with faiss.IndexFlatL2, the feature amount itself can be obtained from the attribute xb. You can also convert it to numpy.ndarray by using the function faiss.vector_float_to_array.

>>> d = 2
>>> nb = 10
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>>
>>> index = faiss.IndexFlatL2(d)
>>> index.add(xb)
>>>
>>> index.xb
<faiss.swigfaiss.FloatVector; proxy of <Swig Object of type 'std::vector< float > *' at 0x7f7ff26dede0> >
>>> faiss.vector_float_to_array(index.xb)
array([0.70434606, 0.8172881 , 0.27514696, 0.04918063, 0.16584012,
       0.58303493, 0.83627784, 0.7318148 , 0.91633004, 0.16084996,
       0.6760264 , 0.65586466, 0.45432937, 0.35858378, 0.0895841 ,
       0.3424391 , 0.6606455 , 0.7392444 , 0.07704416, 0.13714503],
      dtype=float32)

You can check from which attribute the feature itself can be obtained from the header such as faiss / IndexFlat.h.

Numpy.ndarray=>float * x` conversion

[faiss / c_api /] to use swig interface functions not implemented in faiss / python / faiss.py If you look at IndexFlat_c.h](https://github.com/facebookresearch/faiss/blob/master/c_api/IndexFlat_c.h), you may encounter float * x. If you use faiss.swig_ptr for numpy.ndarray, you can convert it to a pointer of swig, so you can use such a little niche function.

>>> x = np.random.random(10).astype(np.float32)
>>> faiss.swig_ptr(x)
<Swig Object of type 'faiss::IndexReplicasTemplate< faiss::Index >::component_t *' at 0x7fe8a57c3b10>

Handle subspace

Until now, many basic functions have been introduced, but from now on, it will be a little more advanced. About the handling of subspace in faiss. First is the distance calculation in the subspace. You can use compute_distance_subset to do the following:

>>> def compute_distance_subset(index, xq, subset):
...     n, _ = xq.shape
...     _, k = subset.shape
...     distances = np.empty((n, k), dtype=np.float32)
...     index.compute_distance_subset(
...             n, faiss.swig_ptr(xq),
...             k, faiss.swig_ptr(distances),
...             faiss.swig_ptr(subset)
...     )
...     return distances
... 
>>>
>>> d = 2
>>> nb = 10000
>>> nq = 10
>>> k = 5
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>> xq = np.random.random((nq, d)).astype(np.float32)
>>> subset = np.random.choice(range(nb), (nq, k))
>>> 
>>> index = faiss.IndexFlatL2(d)
>>> index.add(xb)
>>> 
>>> compute_distance_subset(index, xq, subset)
array([[0.04731181, 0.1585833 , 0.4276843 , 0.02083743, 0.14153683],
       [0.55289364, 0.19499591, 0.24127454, 0.16293366, 0.02044217],
       [0.61750704, 0.48981428, 0.51042193, 0.12334089, 0.55514073],
       [0.5959296 , 0.6604827 , 0.20945217, 0.10136123, 0.01619768],
       [0.13882631, 0.16818088, 0.01572821, 0.17454663, 0.03992677],
       [0.46265444, 0.70609426, 0.49902472, 0.730565  , 0.09248901],
       [1.1638596 , 1.1041545 , 0.73789394, 0.60920525, 0.21328084],
       [0.02405633, 0.00557276, 0.6880306 , 0.821055  , 0.0421453 ],
       [0.08726364, 0.33441633, 0.15067156, 0.28792596, 0.30785137],
       [0.04219329, 0.747885  , 0.01912764, 0.19305223, 0.51132184]],
      dtype=float32)

Note that compute_distance_subset only calculates the distance of the subset, and does not calculate the K-nearest neighbors.

Next is how to create the subspace itself. This can be achieved by using copy_subset_to. If you look at Splitting and merging indexes, you can roughly understand it, so I will write only the notes. The first point is written in the document, but it can only be used with ʻIndex IVF`. The second point is that there are some difficulties in how to create a subspace. As you can see from the Source Code, only continuous and periodic copies are supported.

faiss/IndexIVF.h


/** copy a subset of the entries index to the other index
 *
 * if subset_type == 0: copies ids in [a1, a2)
 * if subset_type == 1: copies ids if id % a1 == a2
 * if subset_type == 2: copies inverted lists such that a1
 *                      elements are left before and a2 elements are after
 */
virtual void copy_subset_to (IndexIVF & other, int subset_type,
                             idx_t a1, idx_t a2) const;

Use your own identity system

By default, consecutive IDs are assigned, but you can use your own ID system by using ʻIndexIDMap`. If you look at Faiss ID mapping, you can see it, but the unique ID is as follows. You can use the system.

>>> d = 2
>>> nb = 10000
>>> nq = 10
>>> k = 2
>>> xb = np.random.random((nb, d)).astype(np.float32)
>>> xq = np.random.random((nq, d)).astype(np.float32)
>>> ids = np.arange(nb) * 10
>>>
>>> index = faiss.IndexFlatL2(d)
>>> custom_index = faiss.IndexIDMap(index)
>>> custom_index.add_with_ids(xb, ids)
>>> custom_index.search(xq, k)
(array([[1.97887421e-05, 2.86698341e-05],
       [1.38282776e-05, 6.81877136e-05],
       [4.52995300e-06, 1.12056732e-05],
       [8.55922699e-05, 9.26256180e-05],
       [1.41859055e-05, 1.38044357e-04],
       [2.40206718e-05, 4.58657742e-05],
       [6.55651093e-06, 3.46302986e-05],
       [5.24520874e-06, 1.35898590e-05],
       [2.90870667e-05, 3.90410423e-05],
       [2.38418579e-07, 2.86102295e-05]], dtype=float32), array([[38210, 66060],
       [51500, 97890],
       [17100, 97780],
       [42300, 51430],
       [ 3790, 63660],
       [19050, 26470],
       [22070, 45900],
       [39140,  4190],
       [10040,  7850],
       [14390, 48690]]))

There is a caveat here. Since faiss.IndexIDMap only maps the original index and ID system, if you are using an older version of faiss, you will get a segmentation fault in the situation where the index can be GC as shown below. The latest version of faiss (1.5.3) seems to solve this problem.

>>> index = faiss.IndexIDMap(faiss.IndexFlatL2(d))
>>> index.add_with_ids(xb, ids)
Segmentation fault (core dumped)

By the way, there is not much use, but you can also search for the original index.

>>> index.search(xq, k)
(array([[1.97887421e-05, 2.86698341e-05],
       [1.38282776e-05, 6.81877136e-05],
       [4.52995300e-06, 1.12056732e-05],
       [8.55922699e-05, 9.26256180e-05],
       [1.41859055e-05, 1.38044357e-04],
       [2.40206718e-05, 4.58657742e-05],
       [6.55651093e-06, 3.46302986e-05],
       [5.24520874e-06, 1.35898590e-05],
       [2.90870667e-05, 3.90410423e-05],
       [2.38418579e-07, 2.86102295e-05]], dtype=float32), array([[3821, 6606],
       [5150, 9789],
       [1710, 9778],
       [4230, 5143],
       [ 379, 6366],
       [1905, 2647],
       [2207, 4590],
       [3914,  419],
       [1004,  785],
       [1439, 4869]]))

Summary

I introduced a little niche function of faiss, which is a neighborhood search library of Facebook Resarch, with a reminder to myself. It seems that it will be a niche article halfway and who will read it ... but I hope I can play the role of one person. I will continue to add it if I try a little niche function of faiss.

Recommended Posts

A little niche feature introduction of faiss
A little scrutiny of pandas 1.0 and dask
Introduction of new voice feature extraction library Surfboard
Introduction of PyGMT
Introduction of cymel
Introduction of Python
Add a list of numpy library functions little by little --a
[PyTorch] A little understanding of CrossEntropyLoss with mathematical formulas
[python] [Gracenote Web API] A little customization of pygn
Introduction of trac (Windows + trac 1.0.10)
Introduction of ferenOS 1 (installation)
Introduction of Virtualenv wrapper
Make a Linux version of OpenSiv3D with find_package a little easier
I just changed the sample source of Python a little.
Add a list of numpy library functions little by little --c
[Introduction to AWS] A memorandum of building a web server on AWS
I tried a little bit of the behavior of the zip function