Autofaiss getting started
Information
This Demo notebook automatically creates a Faiss knn indices with the most optimal similarity search parameters.
It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.
Parameters
[1]:
#@title Index parameters
max_index_query_time_ms = 10 #@param {type: "number"}
max_index_memory_usage = "10MB" #@param
metric_type = "l2" #@param ['ip', 'l2']
Embeddings creation (add your own embeddings here)
[2]:
import numpy as np
# Create embeddings
embeddings = np.float32(np.random.rand(4000, 100))
Save your embeddings on the disk
[3]:
# Create a new folder
import os
import shutil
embeddings_dir = "embeddings_folder"
if os.path.exists(embeddings_dir):
shutil.rmtree(embeddings_dir)
os.makedirs(embeddings_dir)
# Save your embeddings
# You can split you embeddings in several parts if it is too big
# The data will be read in the lexicographical order of the filenames
np.save(f"{embeddings_dir}/part1.npy", embeddings[:2000])
np.save(f"{embeddings_dir}/part2.npy", embeddings[2000:])
Build the KNN index with Autofaiss
[4]:
os.makedirs("my_index_folder", exist_ok=True)
[ ]:
# Install autofaiss
!pip install autofaiss &> /dev/null
# Build a KNN index
!autofaiss build_index --embeddings={embeddings_dir} \
--index_path="knn.index" \
--index_infos_path="infos.json" \
--metric_type={metric_type} \
--max_index_query_time_ms=5 \
--max_index_memory_usage={max_index_memory_usage}
Load the index and play with it
[6]:
import faiss
import glob
import numpy as np
my_index = faiss.read_index("knn.index")
query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)
print(f"Top {k} elements in the dataset for max inner product search:")
for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
print(f"{i+1}: Vector number {indice:4} with distance {dist}")
Top 5 elements in the dataset for max inner product search:
1: Vector number 2933 with distance 10.404068946838379
2: Vector number 168 with distance 10.53512191772461
3: Vector number 2475 with distance 10.688979148864746
4: Vector number 2525 with distance 10.713528633117676
5: Vector number 3463 with distance 10.774477005004883
(Bonus) Python version of the CLI
[ ]:
from autofaiss import build_index
build_index(embeddings="embeddings_folder",
index_path="knn.index",
index_infos_path="infos.json",
max_index_query_time_ms = max_index_query_time_ms,
max_index_memory_usage = max_index_memory_usage,
metric_type=metric_type)
[ ]: