Autofaiss getting started

Information

This Demo notebook automatically creates a Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Github: https://github.com/criteo/autofaiss

Parameters

[1]:

#@title Index parameters

max_index_query_time_ms = 10 #@param {type: "number"}
max_index_memory_usage = "10MB" #@param
metric_type = "l2" #@param ['ip', 'l2']

Embeddings creation (add your own embeddings here)

[2]:

import numpy as np

# Create embeddings
embeddings = np.float32(np.random.rand(4000, 100))

Save your embeddings on the disk

[3]:

# Create a new folder
import os
import shutil
embeddings_dir = "embeddings_folder"
if os.path.exists(embeddings_dir):
  shutil.rmtree(embeddings_dir)
os.makedirs(embeddings_dir)

# Save your embeddings
# You can split you embeddings in several parts if it is too big
# The data will be read in the lexicographical order of the filenames
np.save(f"{embeddings_dir}/part1.npy", embeddings[:2000])
np.save(f"{embeddings_dir}/part2.npy", embeddings[2000:])

Build the KNN index with Autofaiss

[4]:

os.makedirs("my_index_folder", exist_ok=True)

[ ]:

# Install autofaiss
!pip install autofaiss &> /dev/null

# Build a KNN index
!autofaiss build_index --embeddings={embeddings_dir} \
                    --index_path="knn.index" \
                    --index_infos_path="infos.json" \
                    --metric_type={metric_type} \
                    --max_index_query_time_ms=5 \
                    --max_index_memory_usage={max_index_memory_usage}

Load the index and play with it

[6]:

import faiss
import glob
import numpy as np

my_index = faiss.read_index("knn.index")

query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)

print(f"Top {k} elements in the dataset for max inner product search:")
for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
  print(f"{i+1}: Vector number {indice:4} with distance {dist}")

Top 5 elements in the dataset for max inner product search:
1: Vector number 2933 with distance 10.404068946838379
2: Vector number  168 with distance 10.53512191772461
3: Vector number 2475 with distance 10.688979148864746
4: Vector number 2525 with distance 10.713528633117676
5: Vector number 3463 with distance 10.774477005004883

(Bonus) Python version of the CLI

[ ]:

from autofaiss import build_index

build_index(embeddings="embeddings_folder",
                   index_path="knn.index",
                   index_infos_path="infos.json",
                   max_index_query_time_ms = max_index_query_time_ms,
                   max_index_memory_usage = max_index_memory_usage,
                   metric_type=metric_type)

[ ]: