autofaiss.external package

Submodules

autofaiss.external.build module

gather functions necessary to build an index

autofaiss.external.build.add_embeddings_to_index(embedding_reader, trained_index_or_path, metadata, current_memory_available, embedding_ids_df_handler=None, distributed_engine=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', nb_indices_to_keep=1, index_optimizer=None)[source]

Add embeddings to the index

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.external.build.create_index(embedding_reader, index_key, metric_type, current_memory_available, embedding_ids_df_handler=None, use_gpu=False, make_direct_map=False, distributed_engine=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', nb_indices_to_keep=1, index_optimizer=None)[source]

Create an index and add embeddings to the index

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.external.build.create_partitioned_indexes(partitions, output_root_dir, embedding_column_name='embedding', index_key=None, index_path=None, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', big_index_threshold=5000000, nb_splits_per_big_index=1, maximum_nb_threads=256)[source]

Create partitioned indexes from a list of parquet partitions, i.e. create one index per parquet partition

Only supported with Pyspark. An active PySpark session must exist before calling this method

Return type:

List[Optional[Dict[str, str]]]

autofaiss.external.build.estimate_memory_required_for_index_creation(nb_vectors, vec_dim, index_key=None, max_index_memory_usage=None, make_direct_map=False, nb_indices_to_keep=1)[source]

Estimates the RAM necessary to create the index The value returned is in Bytes

Return type:

Tuple[int, str]

autofaiss.external.build.get_estimated_construction_time_infos(nb_vectors, vec_dim, indent=0)[source]

Gives a general approximation of the construction time of the index

Return type:

str

autofaiss.external.descriptions module

File that contains the descriptions of the different indices features.

class autofaiss.external.descriptions.IndexBlock(value)[source]

Bases: Enum

An enumeration.

FLAT = 0
HNSW = 3
IVF = 1
IVF_HNSW = 2
OPQ = 5
PAD = 6
PQ = 4
class autofaiss.external.descriptions.TunableParam(value)[source]

Bases: Enum

An enumeration.

EFSEARCH = 0
HT = 2
NPROBE = 1

autofaiss.external.metadata module

Index metadata for Faiss indices.

class autofaiss.external.metadata.IndexMetadata(index_key, nb_vectors, dim_vector, make_direct_map=False)[source]

Bases: object

Class to compute index metadata given the index_key, the number of vectors and their dimension.

Note: We don’t create classes for each index type in order to keep the code simple.

compute_memory_necessary_for_ivf_flat(nb_training_vectors)[source]

Compute the memory estimation for index type IVF_FLAT.

compute_memory_necessary_for_opq_ivf_hnsw_pq(nb_training_vectors)[source]

Compute the memory estimation for index type OPQ_IVF_HNSW_PQ.

Return type:

float

compute_memory_necessary_for_opq_ivf_pq(nb_training_vectors)[source]

Compute the memory estimation for index type OPQ_IVF_PQ.

Return type:

float

compute_memory_necessary_for_pad_ivf_hnsw_pq(nb_training_vectors)[source]

Compute the memory estimation for index type PAD_IVF_HNSW_PQ.

compute_memory_necessary_for_training(nb_training_vectors)[source]

Function that computes the memory necessary to train an index with nb_training_vectors vectors

Return type:

float

estimated_index_size_in_bytes()[source]

Compute the estimated size of the index in bytes.

Return type:

int

get_index_description(tunable_parameters_infos=False)[source]

Gives a generic description of the index.

Return type:

str

get_index_type()[source]

return the index type.

Return type:

IndexType

class autofaiss.external.metadata.IndexType(value)[source]

Bases: Enum

An enumeration.

FLAT = 0
HNSW = 1
IVF_FLAT = 6
NOT_SUPPORTED = 5
OPQ_IVF_HNSW_PQ = 3
OPQ_IVF_PQ = 2
PAD_IVF_HNSW_PQ = 4
autofaiss.external.metadata.compute_memory_necessary_for_training_wrapper(nb_training_vectors, index_key, dim_vector)[source]

autofaiss.external.optimize module

Functions to find optimal index parameters

autofaiss.external.optimize.binary_search_on_param(index, parameter_range, max_speed_ms, hyperparameter_str_from_param, timeout_boost_for_precision_search=6.0, use_gpu=False, max_timeout_per_iteration_s=1.0)[source]

Apply a binary search on a given hyperparameter to maximize the recall given a query speed constraint in milliseconds/query.

Parameters:
  • index (faiss.Index) – Index to search on.

  • parameter_range (List[T]) – List of possible values for the hyperparameter. This list is sorted.

  • max_speed_ms (float) – Maximum query speed in milliseconds/query.

  • hyperparameter_str_from_param (Callable[[T], str]) – Function to generate a hyperparameter string from the hyperparameter value on which we do a binary search.

  • timeout_boost_for_precision_search (float) – Timeout boost for the precision search phase.

  • use_gpu (bool) – Whether the index is on the GPU.

  • max_timeout_per_iteration_s (float) – Maximum timeout per iteration in seconds.

Return type:

TypeVar(T, int, float)

autofaiss.external.optimize.check_if_index_needs_training(index_key)[source]

Function that checks if the index needs to be trained

Return type:

bool

autofaiss.external.optimize.get_min_param_value_for_best_neighbors_coverage(index, parameter_range, hyperparameter_str_from_param, targeted_nb_neighbors_to_query, *, targeted_coverage=0.99, use_gpu=False)[source]

This function returns the minimal value to set in the index hyperparameters so that, on average, the index retrieves 99% of the requested k=targeted_nb_neighbors_to_query nearest neighbors.

1 ^ ————————
/

nearest | / neighbors | / coverage | /

/
0 +–[————————–]–> param_value

^ ^ ^ | | | | min_param_value | | | min(parameter_range) max(parameter_range)

Parameters:
  • index (faiss.Index) – Index to search on.

  • parameter_range (List[T]) – List of possible values for the hyperparameter. This list is sorted.

  • hyperparameter_str_from_param (Callable[[T], str]) – Function to generate a hyperparameter string from the hyperparameter value on which we do a binary search.

  • targeted_nb_neighbors_to_query (int) – Targeted number of neighbors to query.

  • targeted_coverage (float) – Targeted nearest neighbors coverage. The average ratio of neighbors really retrived when asking for k=targeted_nb_neighbors_to_query nearest neighbors.

  • use_gpu (bool) – Whether the index is on the GPU.

Return type:

TypeVar(T, int, float)

autofaiss.external.optimize.get_optimal_batch_size(vec_dim, current_memory_available)[source]

compute optimal batch size to use the RAM at its full potential for adding vectors

Return type:

int

autofaiss.external.optimize.get_optimal_hyperparameters(index, index_key, max_speed_ms, use_gpu=False, max_timeout_per_iteration_s=1.0, min_ef_search=32, min_nearest_neighbors_to_retrieve=20)[source]

Find the optimal hyperparameters to maximize the recall given a query speed in milliseconds/query

Return type:

str

autofaiss.external.optimize.get_optimal_index_keys_v2(nb_vectors, dim_vector, max_index_memory_usage, flat_threshold=1000, quantization_threshold=10000, force_pq=None, make_direct_map=False, should_be_memory_mappable=False, ivf_flat_threshold=1000000, use_gpu=False)[source]

Gives a list of interesting indices to try, the one at the top is the most promising

See: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index for detailed explanations.

Return type:

List[str]

autofaiss.external.optimize.get_optimal_ivf(nb_vectors)[source]

Function that returns a list of relevant index_keys to create not quantized IVF indices.

Parameters:

nb_vectors (int) – Number of vectors in the dataset.

Return type:

List[str]

autofaiss.external.optimize.get_optimal_nb_clusters(nb_vectors)[source]

Returns a list with the recommended number of clusters for an index containing nb_vectors vectors. The first value is the most recommended one. see: https://github.com/facebookresearch/faiss/wiki/Guidelines-to-choose-an-index

Return type:

List[int]

autofaiss.external.optimize.get_optimal_quantization(nb_vectors, dim_vector, force_quantization_value=None, force_max_index_memory_usage=None)[source]

Function that returns a list of relevant index_keys to create quantized indices.

Return type:

List[str]

Parameters:

nb_vectors: int

Number of vectors in the dataset.

dim_vector: int

Dimension of the vectors in the dataset.

force_quantization_value: Optional[int]

Force to use this value as the size of the quantized vectors (PQx). It can be used with the force_max_index_memory_usage parameter, but the result might be empty.

force_max_index_memory_usage: Optional[str]

Add a memory constraint on the index. It can be used with the force_quantization_value parameter, but the result might be empty.

Return:

: index_keys: List[str]

List of index_keys that would be good choices for quantization. The list can be empty if the given constraints are too strong.

autofaiss.external.optimize.get_optimal_train_size(nb_vectors, index_key, current_memory_available, vec_dim)[source]

Function that determines the number of training points necessary to train the index, based on faiss heuristics for k-means clustering.

Return type:

int

autofaiss.external.optimize.index_key_to_nb_cluster(index_key)[source]

Function that takes an index key and returns the number of clusters

Return type:

int

autofaiss.external.optimize.optimize_and_measure_index(embedding_reader, index, index_infos_path, index_key, index_param, index_path, *, max_index_query_time_ms, min_nearest_neighbors_to_retrieve, save_on_disk, use_gpu)[source]

Optimize one index by selecting the best hyperparameters and calculate its metrics

autofaiss.external.quantize module

main file to create an index from the the begining

autofaiss.external.quantize.build_index(embeddings, index_path='knn.index', index_infos_path='index_infos.json', ids_path=None, save_on_disk=True, file_format='npy', embedding_column_name='embedding', id_columns=None, index_key=None, index_param=None, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, should_be_memory_mappable=False, distributed=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', verbose=20, nb_indices_to_keep=1)[source]

Reads embeddings and creates a quantized index from them. The index is stored on the current machine at the given output path.

Parameters:
  • embeddings (Union[str, np.ndarray, List[str]]) – Local path containing all preprocessed vectors and cached files. This could be a single directory or multiple directories. Files will be added if empty. Or directly the Numpy array of embeddings

  • index_path (Optional(str)) – Destination path of the quantized model.

  • index_infos_path (Optional(str)) – Destination path of the metadata file.

  • ids_path (Optional(str)) – Only useful when id_columns is not None and file_format=`parquet`. T his will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format

  • save_on_disk (bool) – Whether to save the index on disk, default to True.

  • file_format (Optional(str)) – npy or parquet ; default npy

  • embedding_column_name (Optional(str)) – embeddings column name for parquet ; default embedding

  • id_columns (Optional(List[str])) – Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index ; default None

  • index_key (Optional(str)) – Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.

  • index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.

  • max_index_query_time_ms (float) – Bound on the query time for KNN search, this bound is approximative

  • max_index_memory_usage (str) – Maximum size allowed for the index, this bound is strict

  • min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.

  • current_memory_available (str) – Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.

  • use_gpu (bool) – Experimental, gpu training is faster, not tested so far

  • metric_type (str) –

    Similarity function used for query:
    • ”ip” for inner product

    • ”l2” for euclidian distance

  • nb_cores (Optional[int]) – Number of cores to use. Will try to guess the right number if not provided

  • make_direct_map (bool) – Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings)

  • should_be_memory_mappable (bool) – If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False

  • distributed (Optional[str]) – If “pyspark”, create the indices using pyspark. Only “parquet” file format is supported.

  • temporary_indices_folder (str) – Folder to save the temporary small indices that are generated by each spark executor. Only used when distributed = “pyspark”.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

  • nb_indices_to_keep (int) –

    Number of indices to keep at most when distributed is “pyspark”. It allows you to build an index larger than current_memory_available If it is not equal to 1,

    • You are expected to have at most nb_indices_to_keep indices with the following names:

      ”{index_path}i” where i ranges from 1 to nb_indices_to_keep

    • build_index returns a mapping from index path to metrics

    Default to 1.

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.external.quantize.build_partitioned_indexes(partitions, output_root_dir, embedding_column_name='embedding', index_key=None, index_path=None, id_columns=None, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, should_be_memory_mappable=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', verbose=20, nb_splits_per_big_index=1, big_index_threshold=5000000, maximum_nb_threads=256)[source]

Create partitioned indexes from a partitioned parquet dataset, i.e. create one index per parquet partition

Only supported with PySpark. A PySpark session must be active before calling this function

Parameters:
  • partitions (str) – List of partitions containing embeddings

  • output_root_dir (str) – Output root directory where indexes, metrics and ids will be written

  • embedding_column_name (str) – Parquet dataset column name containing embeddings

  • index_key (Optional(str)) – Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.

  • index_path (Optional(str)) – Optional path to an index that will be used to add embeddings. This index must be pre-trained if it needs a training

  • id_columns (Optional(List[str])) – Parquet dataset column name(s) that are used as IDs for embeddings. A mapping from these IDs to faiss indices will be written in separate files.

  • max_index_query_time_ms (float) – Bound on the query time for KNN search, this bound is approximative

  • max_index_memory_usage (str) – Maximum size allowed for the index, this bound is strict

  • min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.

  • current_memory_available (str) – Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.

  • use_gpu (bool) – Experimental, gpu training is faster, not tested so far

  • metric_type (str) –

    Similarity function used for query:
    • ”ip” for inner product

    • ”l2” for euclidean distance

  • nb_cores (Optional[int]) – Number of cores to use. Will try to guess the right number if not provided

  • make_direct_map (bool) – Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings)

  • should_be_memory_mappable (bool) – If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False

  • temp_root_dir (str) – Temporary directory that will be used to store intermediate results/computation

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

  • nb_splits_per_big_index (int) – Number of indices to split a big index into. This allows you building indices bigger than current_memory_available.

  • big_index_threshold (int) – Threshold used to define big indexes. Indexes with more than big_index_threshold embeddings are considered big indexes.

  • maximum_nb_threads (int) – Maximum number of threads to parallelize index creation

Return type:

List[Optional[Dict[str, str]]]

autofaiss.external.quantize.check_not_null_not_empty(name, value)[source]
autofaiss.external.quantize.main()[source]

Main entry point

autofaiss.external.quantize.score_index(index_path, embeddings, save_on_disk=True, output_index_info_path='infos.json', current_memory_available='32G', verbose=20)[source]

Compute metrics on a given index, use cached ground truth for fast scoring the next times.

Parameters:
  • index_path (Union[str, faiss.Index]) – Path to .index file. Or in memory index

  • embeddings (Union[str, np.ndarray]) – Path containing all preprocessed vectors and cached files. Can also be an in memory array.

  • save_on_disk (bool) – Whether to save on disk

  • output_index_info_path (str) – Path to index infos .json

  • current_memory_available (str) – Memory available on the current machine, having more memory is a boost because it reduces the swipe between RAM and disk.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

Returns:

metric_infos – Metric infos of the index.

Return type:

Optional[Dict[str, Union[str, float, int]]]

autofaiss.external.quantize.setup_logging(logging_level)[source]

Setup the logging.

autofaiss.external.quantize.tune_index(index_path, index_key, index_param=None, output_index_path='tuned_knn.index', save_on_disk=True, min_nearest_neighbors_to_retrieve=20, max_index_query_time_ms=10.0, use_gpu=False, verbose=20)[source]

Set hyperparameters to the given index.

If an index_param is given, set this hyperparameters to the index, otherwise perform a greedy heusistic to make the best out or the max_index_query_time_ms constraint

Parameters:
  • index_path (Union[str, faiss.Index]) – Path to .index file Can also be an index

  • index_key (str) – String to give to the index factory in order to create the index.

  • index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.

  • output_index_path (str) – Path to the newly created .index file

  • save_on_disk (bool) – Whether to save the index on disk, default to True.

  • min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index.

  • max_index_query_time_ms (float) – Query speed constraint for the index to create.

  • use_gpu (bool) – Experimental, gpu training is faster, not tested so far.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

Returns:

The faiss index

Return type:

index

autofaiss.external.scores module

Functions to compute metrics on an index

autofaiss.external.scores.compute_fast_metrics(embedding_reader, index, omp_threads=None, query_max=1000)[source]

compute query speed, size and reconstruction of an index

Return type:

Dict

autofaiss.external.scores.compute_medium_metrics(embedding_reader, index, memory_available, ground_truth=None, eval_item_ids=None)[source]

Compute recall@R and intersection recall@R of an index

Return type:

Dict[str, float]

autofaiss.external.scores.get_ground_truth(faiss_metric_type, embedding_reader, query_embeddings, memory_available)[source]

compute the ground truth (result with a perfect index) of the query on the embeddings

Module contents