Creating an index
The use-case
You have limited RAM constraint but need to do similarity search on a lot of vectors? Great! You are in the right place :) This lib automatically builds an optimal index that maximizes the recall scores given a memory and query speed constraint.
The build_index command
The autofaiss build_index
command takes the following parameters:
Flag available |
Default |
Description |
---|---|---|
–embeddings |
required |
Source path of the directory containing your .npy embedding files. If there are several files, they are read in the lexicographical order. This can be a local path or a path in another Filesystem e.g. hdfs://root/… or s3://… |
–index_path |
required |
Destination path of the faiss index on local machine. |
–index_infos_path |
required |
Destination path of the faiss index infos on local machine. |
–save_on_disk |
required |
Save the index on the disk. |
–file_format |
“npy” |
File format of the files in embeddings. Can be either npy for numpy matrix files or parquet for parquet serialized tables |
–embedding_column_name |
“embeddings” |
Only necessary when file_format=`parquet` In this case this is the name of the column containing the embeddings (one vector per row) |
–id_columns |
None |
Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index |
–ids_path |
None |
Only useful when id_columns is not None and file_format=`parquet`. This will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format |
–metric_type |
“ip” |
(Optional) Similarity function used for query: (“ip” for inner product, “l2” for euclidian distance) |
–max_index_memory_usage |
“32GB” |
(Optional) Maximum size in GB of the created index, this bound is strict. |
–current_memory_available |
“32GB” |
(Optional) Memory available (in GB) on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk. |
–max_index_query_time_ms |
10 |
(Optional) Bound on the query time for KNN search, this bound is approximative. |
–min_nearest_neighbors_to_retrieve |
20 |
(Optional) Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint. |
–index_key |
None |
(Optional) If present, the Faiss index will be build using this description string in the index_factory, more detail in the [Faiss documentation](https://github.com/facebookresearch/faiss/wiki/The-index-factory) |
–index_param |
None |
(Optional) If present, the Faiss index will be set using this description string of hyperparameters, more detail in the [Faiss documentation](https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning) |
–use_gpu |
False |
(Optional) Experimental, gpu training can be faster, but this feature is not tested so far. |
–nb_cores |
None |
(Optional) The number of cores to use, by default will use all cores |
–make_direct_map |
False |
(Optional) If set to True and that the created index is an IVF, call .make_direct_map() on the index to build a mapping (stored on RAM only) that speeds up greatly the calls to .reconstruct(). |
–should_be_memory_mappable |
False |
(Optional) If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. |
–distributed |
None |
(Optional) If “pyspark”, create the index using pyspark. Otherwise, the index is created on your local machine. |
–temporary_indices_folder |
“hdfs://root/tmp/distributed_autofaiss_indices” |
(Optional) Folder to save the temporary small indices, only used when distributed = “pyspark” |
–verbose |
20 |
(Optional) Set verbosity of logging output: DEBUG=10, INFO=20, WARN=30, ERROR=40, CRITICAL=50 |
–nb_indices_to_keep |
1 |
(Optional) Number of indices to keep at most when distributed is “pyspark”. |
The same function can be called directly from a python environment (from autofaiss import build_index).
It is possible to force the creation of a specific index with specific hyperparameters if more control is needed. Here is some documentation <https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index> and <https://github.com/facebookresearch/faiss/wiki/The-index-factory> to help you to choose which index you need.
Time required
The time required to run this command is:
For 1TB of vectors -> 2 hours
For 150GB of vectors -> 1 hour
For 50GB of vectors -> 20 minutes
Tuning an existing index
The use-case
You have already created a Faiss index but you would like to have a better recall/query-time ratio? This command creates a new index with different hyperparameters to be closer to your requirements.
The tune_index command
The tune_index command set the hyperparameters for the given index.
If an index_param is given, set this hyperparameters to the index, otherwise perform a greedy heusistic to make the best out or the max_index_query_time_ms constraint
Parameters
- index_pathUnion[str, Any]
Path to .index file on local disk if is_local_index_path is True, otherwise path on hdfs. Can also be an index
- index_key: str
String to give to the index factory in order to create the index.
- index_param: Optional(str)
Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.
- output_index_path: str
Path to the newly created .index file
- save_on_disk: bool
Whether to save the index on disk, default to True.
- min_nearest_neighbors_to_retrieve: int
Minimum number of nearest neighbors to retrieve when querying the index.
- max_index_query_time_ms: float
Query speed constraint for the index to create.
- use_gpu: bool
Experimental, gpu training is faster, not tested so far.
- verbose: int
set verbosity of outputs via logging level, default is logging.INFO
Returns
- index
The faiss index
Time required
The time required to run this command is around 1 minute.
What it does behind
The tuning only works for inverted index with HNSW on top of it (95% of indices created by the lib). there are 3 parameters to tune for that index:
nprobe: The number of cells to visit, directly linked to query time (a cell contains on average nb_total_vectors/nb_clusters vectors)
efSearch: Search parameter of the HNSW on top of the clusters centers. It has a small impact on search time.
ht: The Hamming threshold, adds a boost in speed but reduces the recall.
efSearch is set to be 2 times higher than nprobe, and the Hamming threshold is desactivated by setting it to a high value.
By doing so, we can optimize on only one dimension by applying a binary search given a query time constraint.
Getting scores on an index
The use-case
You have a faiss index and you would like to know it’s 1-recall, intersection recall, query speed, …? There is a command for that too, it’s the score command.
The score command
You just need the path to your index and the embeddings for this one. Be careful, computing accurate metrics is slow.
Compute metrics on a given index, use cached ground truth for fast scoring the next times.
autofaiss score_index --embeddings="folder/embs" --index_path="some.index" --output_index_info_path "infos.json" --current_memory_available="4G"
Parameters
- index_pathUnion[str, Any]
Path to .index file. Or in memory index
- embeddings: str
Local path containing all preprocessed vectors and cached files.
- output_index_info_pathstr
Path to index infos .json
- save_on_diskbool
Whether to save on disk
- current_memory_available: str
Memory available on the current machine, having more memory is a boost because it reduces the swipe between RAM and disk.
- verbose: int
set verbosity of outputs via logging level, default is logging.INFO
Time required
The time required to run this command is around 1 hour for 200M vectors of 1280d (1TB). If the whole dataset fits in RAM it can be much faster.
Creating partitioned indexes
The use-case
You have a partitioned parquet dataset and want to create one index per partition.
The build_partitioned_indexes command
The autofaiss build_partitioned_indexes
command takes the following parameters:
Flag available |
Default |
Description |
---|---|---|
–partitions |
required |
List of partitions containing embeddings. Paths can be local paths or paths in another Filesystem e.g. hdfs://root/… or s3://…. |
–output_root_dir |
required |
Output root directory where indexes, metrics and ids will be written. |
–embedding_column_name |
“embedding” |
Parquet dataset column name containing embeddings. |
–index_key |
None |
Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic. |
–id_columns |
None |
Parquet dataset column name(s) that are used as IDs for embeddings. A mapping from these IDs to faiss indices will be written in separate files. |
–max_index_query_time_ms |
10 |
Bound on the query time for KNN search, this bound is approximative. |
–max_index_memory_usage |
16GB |
Maximum size allowed for the index, this bound is strict. |
–min_nearest_neighbors_to_retrieve |
20 |
Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint. |
–current_memory_available |
32GB |
Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk. |
–use_gpu |
False |
Experimental, gpu training is faster, not tested so far. |
–metric_type |
ip |
Similarity function used for query: “ip” for inner product or “l2” for euclidean distance. |
–nb_cores |
None |
Number of cores to use. Will try to guess the right number if not provided. |
–make_direct_map |
False |
Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings). |
–should_be_memory_mappable |
False |
If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False. |
–temp_root_dir |
“hdfs://root/tmp/distributed_autofaiss_indices” |
Temporary directory that will be used to store intermediate results/computation. |
–verbose |
logging.INFO |
set verbosity of outputs via logging level, default is logging.INFO. |
–nb_splits_per_big_index |
1 |
Number of indices to split a big index into. This allows you building indices bigger than current_memory_available. |
–big_index_threshold |
5_000_000 |
Threshold used to define big indexes. Indexes with more than big_index_threshold embeddings are considered big indexes. |
–maximum_nb_threads |
256 |
Maximum number of threads to parallelize index creation. |
What it does behind
For each partition of the partitioned dataset, one index will be trained and populated with vectors of the partition. All indexes are created in parallel. Also, for big partitions (with more than big_index_threshold vectors), vectors will be added in a distributed way to indexes.