autofaiss.external package
Submodules
autofaiss.external.build module
gather functions necessary to build an index
- autofaiss.external.build.add_embeddings_to_index(embedding_reader, trained_index_or_path, metadata, current_memory_available, embedding_ids_df_handler=None, distributed_engine=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', nb_indices_to_keep=1, index_optimizer=None)[source]
Add embeddings to the index
- autofaiss.external.build.create_index(embedding_reader, index_key, metric_type, current_memory_available, embedding_ids_df_handler=None, use_gpu=False, make_direct_map=False, distributed_engine=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', nb_indices_to_keep=1, index_optimizer=None)[source]
Create an index and add embeddings to the index
- autofaiss.external.build.create_partitioned_indexes(partitions, output_root_dir, embedding_column_name='embedding', index_key=None, index_path=None, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', big_index_threshold=5000000, nb_splits_per_big_index=1, maximum_nb_threads=256)[source]
Create partitioned indexes from a list of parquet partitions, i.e. create one index per parquet partition
Only supported with Pyspark. An active PySpark session must exist before calling this method
autofaiss.external.descriptions module
File that contains the descriptions of the different indices features.
autofaiss.external.metadata module
Index metadata for Faiss indices.
- class autofaiss.external.metadata.IndexMetadata(index_key, nb_vectors, dim_vector, make_direct_map=False)[source]
Bases:
object
Class to compute index metadata given the index_key, the number of vectors and their dimension.
Note: We don’t create classes for each index type in order to keep the code simple.
- compute_memory_necessary_for_ivf_flat(nb_training_vectors)[source]
Compute the memory estimation for index type IVF_FLAT.
- compute_memory_necessary_for_opq_ivf_hnsw_pq(nb_training_vectors)[source]
Compute the memory estimation for index type OPQ_IVF_HNSW_PQ.
- Return type:
- compute_memory_necessary_for_opq_ivf_pq(nb_training_vectors)[source]
Compute the memory estimation for index type OPQ_IVF_PQ.
- Return type:
- compute_memory_necessary_for_pad_ivf_hnsw_pq(nb_training_vectors)[source]
Compute the memory estimation for index type PAD_IVF_HNSW_PQ.
- compute_memory_necessary_for_training(nb_training_vectors)[source]
Function that computes the memory necessary to train an index with nb_training_vectors vectors
- Return type:
- estimated_index_size_in_bytes()[source]
Compute the estimated size of the index in bytes.
- Return type:
autofaiss.external.optimize module
Functions to find optimal index parameters
- autofaiss.external.optimize.binary_search_on_param(index, parameter_range, max_speed_ms, hyperparameter_str_from_param, timeout_boost_for_precision_search=6.0, use_gpu=False, max_timeout_per_iteration_s=1.0)[source]
Apply a binary search on a given hyperparameter to maximize the recall given a query speed constraint in milliseconds/query.
- Parameters:
index (faiss.Index) – Index to search on.
parameter_range (List[T]) – List of possible values for the hyperparameter. This list is sorted.
max_speed_ms (float) – Maximum query speed in milliseconds/query.
hyperparameter_str_from_param (Callable[[T], str]) – Function to generate a hyperparameter string from the hyperparameter value on which we do a binary search.
timeout_boost_for_precision_search (float) – Timeout boost for the precision search phase.
use_gpu (bool) – Whether the index is on the GPU.
max_timeout_per_iteration_s (float) – Maximum timeout per iteration in seconds.
- Return type:
- autofaiss.external.optimize.check_if_index_needs_training(index_key)[source]
Function that checks if the index needs to be trained
- Return type:
- autofaiss.external.optimize.get_min_param_value_for_best_neighbors_coverage(index, parameter_range, hyperparameter_str_from_param, targeted_nb_neighbors_to_query, *, targeted_coverage=0.99, use_gpu=False)[source]
This function returns the minimal value to set in the index hyperparameters so that, on average, the index retrieves 99% of the requested k=targeted_nb_neighbors_to_query nearest neighbors.
- 1 ^ ————————
- /
nearest | / neighbors | / coverage | /
/- 0 +–[————————–]–> param_value
^ ^ ^ | | | | min_param_value | | | min(parameter_range) max(parameter_range)
- Parameters:
index (faiss.Index) – Index to search on.
parameter_range (List[T]) – List of possible values for the hyperparameter. This list is sorted.
hyperparameter_str_from_param (Callable[[T], str]) – Function to generate a hyperparameter string from the hyperparameter value on which we do a binary search.
targeted_nb_neighbors_to_query (int) – Targeted number of neighbors to query.
targeted_coverage (float) – Targeted nearest neighbors coverage. The average ratio of neighbors really retrived when asking for k=targeted_nb_neighbors_to_query nearest neighbors.
use_gpu (bool) – Whether the index is on the GPU.
- Return type:
- autofaiss.external.optimize.get_optimal_batch_size(vec_dim, current_memory_available)[source]
compute optimal batch size to use the RAM at its full potential for adding vectors
- Return type:
- autofaiss.external.optimize.get_optimal_hyperparameters(index, index_key, max_speed_ms, use_gpu=False, max_timeout_per_iteration_s=1.0, min_ef_search=32, min_nearest_neighbors_to_retrieve=20)[source]
Find the optimal hyperparameters to maximize the recall given a query speed in milliseconds/query
- Return type:
- autofaiss.external.optimize.get_optimal_index_keys_v2(nb_vectors, dim_vector, max_index_memory_usage, flat_threshold=1000, quantization_threshold=10000, force_pq=None, make_direct_map=False, should_be_memory_mappable=False, ivf_flat_threshold=1000000, use_gpu=False)[source]
Gives a list of interesting indices to try, the one at the top is the most promising
See: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index for detailed explanations.
- autofaiss.external.optimize.get_optimal_ivf(nb_vectors)[source]
Function that returns a list of relevant index_keys to create not quantized IVF indices.
- autofaiss.external.optimize.get_optimal_nb_clusters(nb_vectors)[source]
Returns a list with the recommended number of clusters for an index containing nb_vectors vectors. The first value is the most recommended one. see: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index
- autofaiss.external.optimize.get_optimal_quantization(nb_vectors, dim_vector, force_quantization_value=None, force_max_index_memory_usage=None)[source]
Function that returns a list of relevant index_keys to create quantized indices.
Parameters:
- nb_vectors: int
Number of vectors in the dataset.
- dim_vector: int
Dimension of the vectors in the dataset.
- force_quantization_value: Optional[int]
Force to use this value as the size of the quantized vectors (PQx). It can be used with the force_max_index_memory_usage parameter, but the result might be empty.
- force_max_index_memory_usage: Optional[str]
Add a memory constraint on the index. It can be used with the force_quantization_value parameter, but the result might be empty.
Return:
: index_keys: List[str]
List of index_keys that would be good choices for quantization. The list can be empty if the given constraints are too strong.
- autofaiss.external.optimize.get_optimal_train_size(nb_vectors, index_key, current_memory_available, vec_dim)[source]
Function that determines the number of training points necessary to train the index, based on faiss heuristics for k-means clustering.
- Return type:
- autofaiss.external.optimize.index_key_to_nb_cluster(index_key)[source]
Function that takes an index key and returns the number of clusters
- Return type:
- autofaiss.external.optimize.optimize_and_measure_index(embedding_reader, index, index_infos_path, index_key, index_param, index_path, *, max_index_query_time_ms, min_nearest_neighbors_to_retrieve, save_on_disk, use_gpu)[source]
Optimize one index by selecting the best hyperparameters and calculate its metrics
autofaiss.external.quantize module
main file to create an index from the the begining
- autofaiss.external.quantize.build_index(embeddings, index_path='knn.index', index_infos_path='index_infos.json', ids_path=None, save_on_disk=True, file_format='npy', embedding_column_name='embedding', id_columns=None, index_key=None, index_param=None, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, should_be_memory_mappable=False, distributed=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', verbose=20, nb_indices_to_keep=1)[source]
Reads embeddings and creates a quantized index from them. The index is stored on the current machine at the given output path.
- Parameters:
embeddings (Union[str, np.ndarray, List[str]]) – Local path containing all preprocessed vectors and cached files. This could be a single directory or multiple directories. Files will be added if empty. Or directly the Numpy array of embeddings
index_path (Optional(str)) – Destination path of the quantized model.
index_infos_path (Optional(str)) – Destination path of the metadata file.
ids_path (Optional(str)) – Only useful when id_columns is not None and file_format=`parquet`. T his will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format
save_on_disk (bool) – Whether to save the index on disk, default to True.
file_format (Optional(str)) – npy or parquet ; default npy
embedding_column_name (Optional(str)) – embeddings column name for parquet ; default embedding
id_columns (Optional(List[str])) – Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index ; default None
index_key (Optional(str)) – Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.
index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.
max_index_query_time_ms (float) – Bound on the query time for KNN search, this bound is approximative
max_index_memory_usage (str) – Maximum size allowed for the index, this bound is strict
min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.
current_memory_available (str) – Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.
use_gpu (bool) – Experimental, gpu training is faster, not tested so far
metric_type (str) –
- Similarity function used for query:
”ip” for inner product
”l2” for euclidian distance
nb_cores (Optional[int]) – Number of cores to use. Will try to guess the right number if not provided
make_direct_map (bool) – Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings)
should_be_memory_mappable (bool) – If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False
distributed (Optional[str]) – If “pyspark”, create the indices using pyspark. Only “parquet” file format is supported.
temporary_indices_folder (str) – Folder to save the temporary small indices that are generated by each spark executor. Only used when distributed = “pyspark”.
verbose (int) – set verbosity of outputs via logging level, default is logging.INFO
nb_indices_to_keep (int) –
Number of indices to keep at most when distributed is “pyspark”. It allows you to build an index larger than current_memory_available If it is not equal to 1,
- You are expected to have at most nb_indices_to_keep indices with the following names:
”{index_path}i” where i ranges from 1 to nb_indices_to_keep
build_index returns a mapping from index path to metrics
Default to 1.
- Return type:
- autofaiss.external.quantize.build_partitioned_indexes(partitions, output_root_dir, embedding_column_name='embedding', index_key=None, index_path=None, id_columns=None, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, should_be_memory_mappable=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', verbose=20, nb_splits_per_big_index=1, big_index_threshold=5000000, maximum_nb_threads=256)[source]
Create partitioned indexes from a partitioned parquet dataset, i.e. create one index per parquet partition
Only supported with PySpark. A PySpark session must be active before calling this function
- Parameters:
partitions (str) – List of partitions containing embeddings
output_root_dir (str) – Output root directory where indexes, metrics and ids will be written
embedding_column_name (str) – Parquet dataset column name containing embeddings
index_key (Optional(str)) – Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.
index_path (Optional(str)) – Optional path to an index that will be used to add embeddings. This index must be pre-trained if it needs a training
id_columns (Optional(List[str])) – Parquet dataset column name(s) that are used as IDs for embeddings. A mapping from these IDs to faiss indices will be written in separate files.
max_index_query_time_ms (float) – Bound on the query time for KNN search, this bound is approximative
max_index_memory_usage (str) – Maximum size allowed for the index, this bound is strict
min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.
current_memory_available (str) – Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.
use_gpu (bool) – Experimental, gpu training is faster, not tested so far
metric_type (str) –
- Similarity function used for query:
”ip” for inner product
”l2” for euclidean distance
nb_cores (Optional[int]) – Number of cores to use. Will try to guess the right number if not provided
make_direct_map (bool) – Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings)
should_be_memory_mappable (bool) – If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False
temp_root_dir (str) – Temporary directory that will be used to store intermediate results/computation
verbose (int) – set verbosity of outputs via logging level, default is logging.INFO
nb_splits_per_big_index (int) – Number of indices to split a big index into. This allows you building indices bigger than current_memory_available.
big_index_threshold (int) – Threshold used to define big indexes. Indexes with more than big_index_threshold embeddings are considered big indexes.
maximum_nb_threads (int) – Maximum number of threads to parallelize index creation
- Return type:
- autofaiss.external.quantize.score_index(index_path, embeddings, save_on_disk=True, output_index_info_path='infos.json', current_memory_available='32G', verbose=20)[source]
Compute metrics on a given index, use cached ground truth for fast scoring the next times.
- Parameters:
index_path (Union[str, faiss.Index]) – Path to .index file. Or in memory index
embeddings (Union[str, np.ndarray]) – Path containing all preprocessed vectors and cached files. Can also be an in memory array.
save_on_disk (bool) – Whether to save on disk
output_index_info_path (str) – Path to index infos .json
current_memory_available (str) – Memory available on the current machine, having more memory is a boost because it reduces the swipe between RAM and disk.
verbose (int) – set verbosity of outputs via logging level, default is logging.INFO
- Returns:
metric_infos – Metric infos of the index.
- Return type:
- autofaiss.external.quantize.tune_index(index_path, index_key, index_param=None, output_index_path='tuned_knn.index', save_on_disk=True, min_nearest_neighbors_to_retrieve=20, max_index_query_time_ms=10.0, use_gpu=False, verbose=20)[source]
Set hyperparameters to the given index.
If an index_param is given, set this hyperparameters to the index, otherwise perform a greedy heusistic to make the best out or the max_index_query_time_ms constraint
- Parameters:
index_path (Union[str, faiss.Index]) – Path to .index file Can also be an index
index_key (str) – String to give to the index factory in order to create the index.
index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.
output_index_path (str) – Path to the newly created .index file
save_on_disk (bool) – Whether to save the index on disk, default to True.
min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index.
max_index_query_time_ms (float) – Query speed constraint for the index to create.
use_gpu (bool) – Experimental, gpu training is faster, not tested so far.
verbose (int) – set verbosity of outputs via logging level, default is logging.INFO
- Returns:
The faiss index
- Return type:
index
autofaiss.external.scores module
Functions to compute metrics on an index
- autofaiss.external.scores.compute_fast_metrics(embedding_reader, index, omp_threads=None, query_max=1000)[source]
compute query speed, size and reconstruction of an index
- Return type: