Creating an index

The use-case

You have limited RAM constraint but need to do similarity search on a lot of vectors? Great! You are in the right place :) This lib automatically builds an optimal index that maximizes the recall scores given a memory and query speed constraint.

The build_index command

The autofaiss build_index command takes the following parameters:

Parameters
Flag available	Default	Description
–embeddings	required	Source path of the directory containing your .npy embedding files. If there are several files, they are read in the lexicographical order. This can be a local path or a path in another Filesystem e.g. hdfs://root/… or s3://…
–index_path	required	Destination path of the faiss index on local machine.
–index_infos_path	required	Destination path of the faiss index infos on local machine.
–save_on_disk	required	Save the index on the disk.
–file_format	“npy”	File format of the files in embeddings. Can be either npy for numpy matrix files or parquet for parquet serialized tables
–embedding_column_name	“embeddings”	Only necessary when file_format=`parquet` In this case this is the name of the column containing the embeddings (one vector per row)
–id_columns	None	Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index
–ids_path	None	Only useful when id_columns is not None and file_format=`parquet`. This will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format
–metric_type	“ip”	(Optional) Similarity function used for query: (“ip” for inner product, “l2” for euclidian distance)
–max_index_memory_usage	“32GB”	(Optional) Maximum size in GB of the created index, this bound is strict.
–current_memory_available	“32GB”	(Optional) Memory available (in GB) on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.
–max_index_query_time_ms	10	(Optional) Bound on the query time for KNN search, this bound is approximative.
–min_nearest_neighbors_to_retrieve	20	(Optional) Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.
–index_key	None	(Optional) If present, the Faiss index will be build using this description string in the index_factory, more detail in the [Faiss documentation](https://github.com/facebookresearch/faiss/wiki/The-index-factory)
–index_param	None	(Optional) If present, the Faiss index will be set using this description string of hyperparameters, more detail in the [Faiss documentation](https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning)
–use_gpu	False	(Optional) Experimental, gpu training can be faster, but this feature is not tested so far.
–nb_cores	None	(Optional) The number of cores to use, by default will use all cores
–make_direct_map	False	(Optional) If set to True and that the created index is an IVF, call .make_direct_map() on the index to build a mapping (stored on RAM only) that speeds up greatly the calls to .reconstruct().
–should_be_memory_mappable	False	(Optional) If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM.
–distributed	None	(Optional) If “pyspark”, create the index using pyspark. Otherwise, the index is created on your local machine.
–temporary_indices_folder	“hdfs://root/tmp/distributed_autofaiss_indices”	(Optional) Folder to save the temporary small indices, only used when distributed = “pyspark”
–verbose	20	(Optional) Set verbosity of logging output: DEBUG=10, INFO=20, WARN=30, ERROR=40, CRITICAL=50
–nb_indices_to_keep	1	(Optional) Number of indices to keep at most when distributed is “pyspark”.

The same function can be called directly from a python environment (from autofaiss import build_index).

It is possible to force the creation of a specific index with specific hyperparameters if more control is needed. Here is some documentation <https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index> and <https://github.com/facebookresearch/faiss/wiki/The-index-factory> to help you to choose which index you need.

Time required

The time required to run this command is:

For 1TB of vectors -> 2 hours
For 150GB of vectors -> 1 hour
For 50GB of vectors -> 20 minutes

Tuning an existing index

The use-case

You have already created a Faiss index but you would like to have a better recall/query-time ratio? This command creates a new index with different hyperparameters to be closer to your requirements.

The tune_index command

The tune_index command set the hyperparameters for the given index.

If an index_param is given, set this hyperparameters to the index, otherwise perform a greedy heusistic to make the best out or the max_index_query_time_ms constraint

Parameters

index_pathUnion[str, Any]: Path to .index file on local disk if is_local_index_path is True, otherwise path on hdfs. Can also be an index
index_key: str: String to give to the index factory in order to create the index.
index_param: Optional(str): Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.
output_index_path: str: Path to the newly created .index file
save_on_disk: bool: Whether to save the index on disk, default to True.
min_nearest_neighbors_to_retrieve: int: Minimum number of nearest neighbors to retrieve when querying the index.
max_index_query_time_ms: float: Query speed constraint for the index to create.
use_gpu: bool: Experimental, gpu training is faster, not tested so far.
verbose: int: set verbosity of outputs via logging level, default is logging.INFO

Returns

index: The faiss index

Time required

The time required to run this command is around 1 minute.

What it does behind

The tuning only works for inverted index with HNSW on top of it (95% of indices created by the lib). there are 3 parameters to tune for that index:

nprobe: The number of cells to visit, directly linked to query time (a cell contains on average nb_total_vectors/nb_clusters vectors)
efSearch: Search parameter of the HNSW on top of the clusters centers. It has a small impact on search time.
ht: The Hamming threshold, adds a boost in speed but reduces the recall.

efSearch is set to be 2 times higher than nprobe, and the Hamming threshold is desactivated by setting it to a high value.

By doing so, we can optimize on only one dimension by applying a binary search given a query time constraint.

Getting scores on an index

The use-case

You have a faiss index and you would like to know it’s 1-recall, intersection recall, query speed, …? There is a command for that too, it’s the score command.

The score command

You just need the path to your index and the embeddings for this one. Be careful, computing accurate metrics is slow.

Compute metrics on a given index, use cached ground truth for fast scoring the next times.

autofaiss score_index --embeddings="folder/embs" --index_path="some.index" --output_index_info_path "infos.json" --current_memory_available="4G"

Parameters

index_pathUnion[str, Any]: Path to .index file. Or in memory index
embeddings: str: Local path containing all preprocessed vectors and cached files.
output_index_info_pathstr: Path to index infos .json
save_on_diskbool: Whether to save on disk
current_memory_available: str: Memory available on the current machine, having more memory is a boost because it reduces the swipe between RAM and disk.
verbose: int: set verbosity of outputs via logging level, default is logging.INFO

Time required

The time required to run this command is around 1 hour for 200M vectors of 1280d (1TB). If the whole dataset fits in RAM it can be much faster.

Creating partitioned indexes

The use-case

You have a partitioned parquet dataset and want to create one index per partition.

The build_partitioned_indexes command

The autofaiss build_partitioned_indexes command takes the following parameters:

Parameters
Flag available	Default	Description
–partitions	required	List of partitions containing embeddings. Paths can be local paths or paths in another Filesystem e.g. hdfs://root/… or s3://….
–output_root_dir	required	Output root directory where indexes, metrics and ids will be written.
–embedding_column_name	“embedding”	Parquet dataset column name containing embeddings.
–index_key	None	Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.
–id_columns	None	Parquet dataset column name(s) that are used as IDs for embeddings. A mapping from these IDs to faiss indices will be written in separate files.
–max_index_query_time_ms	10	Bound on the query time for KNN search, this bound is approximative.
–max_index_memory_usage	16GB	Maximum size allowed for the index, this bound is strict.
–min_nearest_neighbors_to_retrieve	20	Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.
–current_memory_available	32GB	Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.
–use_gpu	False	Experimental, gpu training is faster, not tested so far.
–metric_type	ip	Similarity function used for query: “ip” for inner product or “l2” for euclidean distance.
–nb_cores	None	Number of cores to use. Will try to guess the right number if not provided.
–make_direct_map	False	Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings).
–should_be_memory_mappable	False	If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False.
–temp_root_dir	“hdfs://root/tmp/distributed_autofaiss_indices”	Temporary directory that will be used to store intermediate results/computation.
–verbose	logging.INFO	set verbosity of outputs via logging level, default is logging.INFO.
–nb_splits_per_big_index	1	Number of indices to split a big index into. This allows you building indices bigger than current_memory_available.
–big_index_threshold	5_000_000	Threshold used to define big indexes. Indexes with more than big_index_threshold embeddings are considered big indexes.
–maximum_nb_threads	256	Maximum number of threads to parallelize index creation.

What it does behind

For each partition of the partitioned dataset, one index will be trained and populated with vectors of the partition. All indexes are created in parallel. Also, for big partitions (with more than big_index_threshold vectors), vectors will be added in a distributed way to indexes.