Creating an index

The use-case

You have limited RAM constraint but need to do similarity search on a lot of vectors? Great! You are in the right place :) This lib automatically builds an optimal index that maximizes the recall scores given a memory and query speed constraint.

The build_index command

The autofaiss build_index command takes the following parameters:

Parameters

Flag available

Default

Description

–embeddings

required

Source path of the directory containing your .npy embedding files. If there are several files, they are read in the lexicographical order. This can be a local path or a path in another Filesystem e.g. hdfs://root/… or s3://…

–index_path

required

Destination path of the faiss index on local machine.

–index_infos_path

required

Destination path of the faiss index infos on local machine.

–save_on_disk

required

Save the index on the disk.

–file_format

“npy”

File format of the files in embeddings. Can be either npy for numpy matrix files or parquet for parquet serialized tables

–embedding_column_name

“embeddings”

Only necessary when file_format=`parquet` In this case this is the name of the column containing the embeddings (one vector per row)

–id_columns

None

Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index

–ids_path

None

Only useful when id_columns is not None and file_format=`parquet`. This will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format

–metric_type

“ip”

(Optional) Similarity function used for query: (“ip” for inner product, “l2” for euclidian distance)

–max_index_memory_usage

“32GB”

(Optional) Maximum size in GB of the created index, this bound is strict.

–current_memory_available

“32GB”

(Optional) Memory available (in GB) on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.

–max_index_query_time_ms

10

(Optional) Bound on the query time for KNN search, this bound is approximative.

–min_nearest_neighbors_to_retrieve

20

(Optional) Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.

–index_key

None

(Optional) If present, the Faiss index will be build using this description string in the index_factory, more detail in the [Faiss documentation](https://github.com/facebookresearch/faiss/wiki/The-index-factory)

–index_param

None

(Optional) If present, the Faiss index will be set using this description string of hyperparameters, more detail in the [Faiss documentation](https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning)

–use_gpu

False

(Optional) Experimental, gpu training can be faster, but this feature is not tested so far.

–nb_cores

None

(Optional) The number of cores to use, by default will use all cores

–make_direct_map

False

(Optional) If set to True and that the created index is an IVF, call .make_direct_map() on the index to build a mapping (stored on RAM only) that speeds up greatly the calls to .reconstruct().

–should_be_memory_mappable

False

(Optional) If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM.

–distributed

None

(Optional) If “pyspark”, create the index using pyspark. Otherwise, the index is created on your local machine.

–temporary_indices_folder

“hdfs://root/tmp/distributed_autofaiss_indices”

(Optional) Folder to save the temporary small indices, only used when distributed = “pyspark”

–verbose

20

(Optional) Set verbosity of logging output: DEBUG=10, INFO=20, WARN=30, ERROR=40, CRITICAL=50

–nb_indices_to_keep

1

(Optional) Number of indices to keep at most when distributed is “pyspark”.

The same function can be called directly from a python environment (from autofaiss import build_index).

It is possible to force the creation of a specific index with specific hyperparameters if more control is needed. Here is some documentation <https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index> and <https://github.com/facebookresearch/faiss/wiki/The-index-factory> to help you to choose which index you need.

Time required

The time required to run this command is:

  • For 1TB of vectors -> 2 hours

  • For 150GB of vectors -> 1 hour

  • For 50GB of vectors -> 20 minutes

Tuning an existing index

The use-case

You have already created a Faiss index but you would like to have a better recall/query-time ratio? This command creates a new index with different hyperparameters to be closer to your requirements.

The tune_index command

The tune_index command set the hyperparameters for the given index.

If an index_param is given, set this hyperparameters to the index, otherwise perform a greedy heusistic to make the best out or the max_index_query_time_ms constraint

Parameters

index_pathUnion[str, Any]

Path to .index file on local disk if is_local_index_path is True, otherwise path on hdfs. Can also be an index

index_key: str

String to give to the index factory in order to create the index.

index_param: Optional(str)

Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.

output_index_path: str

Path to the newly created .index file

save_on_disk: bool

Whether to save the index on disk, default to True.

min_nearest_neighbors_to_retrieve: int

Minimum number of nearest neighbors to retrieve when querying the index.

max_index_query_time_ms: float

Query speed constraint for the index to create.

use_gpu: bool

Experimental, gpu training is faster, not tested so far.

verbose: int

set verbosity of outputs via logging level, default is logging.INFO

Returns

index

The faiss index

Time required

The time required to run this command is around 1 minute.

What it does behind

The tuning only works for inverted index with HNSW on top of it (95% of indices created by the lib). there are 3 parameters to tune for that index:

  • nprobe: The number of cells to visit, directly linked to query time (a cell contains on average nb_total_vectors/nb_clusters vectors)

  • efSearch: Search parameter of the HNSW on top of the clusters centers. It has a small impact on search time.

  • ht: The Hamming threshold, adds a boost in speed but reduces the recall.

efSearch is set to be 2 times higher than nprobe, and the Hamming threshold is desactivated by setting it to a high value.

By doing so, we can optimize on only one dimension by applying a binary search given a query time constraint.

Getting scores on an index

The use-case

You have a faiss index and you would like to know it’s 1-recall, intersection recall, query speed, …? There is a command for that too, it’s the score command.

The score command

You just need the path to your index and the embeddings for this one. Be careful, computing accurate metrics is slow.

Compute metrics on a given index, use cached ground truth for fast scoring the next times.

autofaiss score_index --embeddings="folder/embs" --index_path="some.index" --output_index_info_path "infos.json" --current_memory_available="4G"

Parameters

index_pathUnion[str, Any]

Path to .index file. Or in memory index

embeddings: str

Local path containing all preprocessed vectors and cached files.

output_index_info_pathstr

Path to index infos .json

save_on_diskbool

Whether to save on disk

current_memory_available: str

Memory available on the current machine, having more memory is a boost because it reduces the swipe between RAM and disk.

verbose: int

set verbosity of outputs via logging level, default is logging.INFO

Time required

The time required to run this command is around 1 hour for 200M vectors of 1280d (1TB). If the whole dataset fits in RAM it can be much faster.

Creating partitioned indexes

The use-case

You have a partitioned parquet dataset and want to create one index per partition.

The build_partitioned_indexes command

The autofaiss build_partitioned_indexes command takes the following parameters:

Parameters

Flag available

Default

Description

–partitions

required

List of partitions containing embeddings. Paths can be local paths or paths in another Filesystem e.g. hdfs://root/… or s3://….

–output_root_dir

required

Output root directory where indexes, metrics and ids will be written.

–embedding_column_name

“embedding”

Parquet dataset column name containing embeddings.

–index_key

None

Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.

–id_columns

None

Parquet dataset column name(s) that are used as IDs for embeddings. A mapping from these IDs to faiss indices will be written in separate files.

–max_index_query_time_ms

10

Bound on the query time for KNN search, this bound is approximative.

–max_index_memory_usage

16GB

Maximum size allowed for the index, this bound is strict.

–min_nearest_neighbors_to_retrieve

20

Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.

–current_memory_available

32GB

Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.

–use_gpu

False

Experimental, gpu training is faster, not tested so far.

–metric_type

ip

Similarity function used for query: “ip” for inner product or “l2” for euclidean distance.

–nb_cores

None

Number of cores to use. Will try to guess the right number if not provided.

–make_direct_map

False

Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings).

–should_be_memory_mappable

False

If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False.

–temp_root_dir

“hdfs://root/tmp/distributed_autofaiss_indices”

Temporary directory that will be used to store intermediate results/computation.

–verbose

logging.INFO

set verbosity of outputs via logging level, default is logging.INFO.

–nb_splits_per_big_index

1

Number of indices to split a big index into. This allows you building indices bigger than current_memory_available.

–big_index_threshold

5_000_000

Threshold used to define big indexes. Indexes with more than big_index_threshold embeddings are considered big indexes.

–maximum_nb_threads

256

Maximum number of threads to parallelize index creation.

What it does behind

For each partition of the partitioned dataset, one index will be trained and populated with vectors of the partition. All indexes are created in parallel. Also, for big partitions (with more than big_index_threshold vectors), vectors will be added in a distributed way to indexes.