autofaiss.indices package

Submodules

autofaiss.indices.build module

Common functions to build an index

autofaiss.indices.build.add_embeddings_to_index_local(embedding_reader, trained_index_or_path, memory_available_for_adding, embedding_ids_df_handler=None, index_optimizer=None, add_embeddings_with_ids=False)[source]

Add embeddings to index from driver

Return type:: Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.indices.build.get_optimize_index_fn(embedding_reader, index_key, index_path, index_infos_path, use_gpu, save_on_disk, max_index_query_time_ms, min_nearest_neighbors_to_retrieve, make_direct_map, index_param)[source]

Create function to optimize index by choosing best hyperparameters and calculating metrics

Return type:: Callable[[Index, str], Dict]

autofaiss.indices.build.get_write_ids_df_to_parquet_fn(ids_root_dir)[source]

Create function to write ids from Pandas dataframe to parquet

Return type:: Callable[[DataFrame, int], None]

autofaiss.indices.distributed module

Building the index with pyspark.

autofaiss.indices.distributed.add_embeddings_to_index_distributed(trained_index_or_path, embedding_reader, memory_available_for_adding, nb_cores=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', embedding_ids_df_handler=None, nb_indices_to_keep=1, index_optimizer=None)[source]

Create indices by pyspark.

Parameters:

trained_index_or_path (trained faiss.Index or path to a trained faiss index) – Trained faiss index
embedding_reader (EmbeddingReader) – Embedding reader.
memory_available_for_adding (str) – Memory available for adding embeddings.
nb_cores (int) – Number of CPU cores per executor
temporary_indices_folder (str) – Folder to save the temporary small indices
embedding_ids_df_handler (Optional[Callable[[pd.DataFrame, int], Any]]) – The function that handles the embeddings Ids when id_columns is given
nb_indices_to_keep (int) – Number of indices to keep at most after the merging step
index_optimizer (Optional[Callable]) – The function that optimizes the index

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.indices.distributed.create_big_index(embedding_root_dirs, output_root_dir, ss, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, current_memory_available='32G', nb_cores=None, use_gpu=False, metric_type='ip', nb_splits_per_big_index=1, make_direct_map=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices')[source]

Create a big index

Return type:: Optional[Dict[str, str]]

autofaiss.indices.distributed.create_partitioned_indexes(partitions, big_index_threshold, output_root_dir, nb_cores, nb_splits_per_big_index, id_columns=None, max_index_query_time_ms=10.0, min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, max_index_memory_usage='16G', current_memory_available='32G', use_gpu=False, metric_type='ip', make_direct_map=False, should_be_memory_mappable=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', maximum_nb_threads=256)[source]

Create partitioned indexes from a list of parquet partitions, i.e. create and train one index per parquet partition

Return type:: List[Optional[Dict[str, str]]]

autofaiss.indices.distributed.create_small_index(embedding_root_dirs, output_root_dir, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False)[source]

Create a small index

Return type:: Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.indices.faiss_index_wrapper module

This file contains a wrapper class to create Faiss-like indices

class autofaiss.indices.faiss_index_wrapper.FaissIndexWrapper(d, metric_type)[source]

Bases: ABC

This abstract class is describing a Faiss-like index It is useful to use this wrapper to use benchmarking functions written for faiss in this library

abstract add(x)[source]

Function that adds vectors to the index

Parameters:: x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

abstract search(x, k)[source]

Function that search the k nearest neighbours of a batch of vectors

Parameters:

x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)
k (int) – Number of neighbours to retrieve for every vector

Returns:

D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.
I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.

autofaiss.indices.index_factory module

functions that fixe faiss index_factory function

autofaiss.indices.index_factory.index_factory(d, index_key, metric_type, ef_construction=None)[source]: custom index_factory that fix some issues of faiss.index_factory with inner product metrics.

autofaiss.indices.index_utils module

useful functions to apply on an index

autofaiss.indices.index_utils.format_speed_ms_per_query(speed)[source]

format the speed (ms/query) into a nice string

Return type:: str

autofaiss.indices.index_utils.get_bytes_from_index(index)[source]

Transforms a faiss index into a bytearray.

Return type:: bytearray

autofaiss.indices.index_utils.get_index_from_bytes(index_bytes)[source]

Transforms a bytearray containing a faiss index into the corresponding object.

Return type:: Index

autofaiss.indices.index_utils.get_index_size(index)[source]

Returns the size in RAM of a given index

Return type:: int

autofaiss.indices.index_utils.initialize_direct_map(index)[source]

Return type:: None

autofaiss.indices.index_utils.load_index(index_src_path, index_dst_path)[source]

Return type:: Index

autofaiss.indices.index_utils.parallel_download_indices_from_remote(fs, indices_file_paths, dst_folder)[source]: Download small indices in parallel.

autofaiss.indices.index_utils.quantize_vec_without_modifying_index(index, vecs)[source]

qantize a batch of vectors

Return type:: ndarray

autofaiss.indices.index_utils.save_index(index, root_dir, index_filename)[source]

Save index

Return type:: str

autofaiss.indices.index_utils.search_speed_test(index, query=None, ksearch=40, timout_s=10.0)[source]

return the average and 99p search speed

Return type:: Dict[str, float]

autofaiss.indices.index_utils.set_search_hyperparameters(index, param_str, use_gpu=False)[source]

set hyperparameters to an index

Return type:: None

autofaiss.indices.index_utils.speed_test_ms_per_query(index, query=None, ksearch=40, timout_s=5.0)[source]

Evaluate the average speed in milliseconds of the index without using batch

Return type:: float

autofaiss.indices.memory_efficient_flat_index module

This file contain a class describing a memory efficient flat index

class autofaiss.indices.memory_efficient_flat_index.MemEfficientFlatIndex(d, metric_type)[source]

Bases: FaissIndexWrapper

Faiss-like Flat index that can support any size of vectors without memory issues. Two search functions are available to use either batch of smaller faiss flat index or rely fully on numpy.

add(x)[source]

Function that adds vectors to the index

Parameters:: x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

add_all(filename, nb_items)[source]

Function that adds vectors to the index from a memmory-mapped array

Parameters:

filename (string) – path of the 2D numpy array of shape (nb_items, vector_dim) on the disk
nb_items (int) – number of vectors in the 2D array (the dim is already known)

add_files(embedding_reader)[source]

delete_vectors()[source]: delete the vectors of the index

search(x, k, batch_size=4000000)[source]

Function that search the k nearest neighbours of a batch of vectors

Parameters:

x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)
k (int) – Number of neighbours to retrieve for every vector
batch_size (int) – Size of the batch of vectors that are explored. A bigger value is prefered to avoid multiple loadings of the vectors from the disk.

Returns:

D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.
I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.

search_files(x, k, batch_size)[source]

search_numpy(xq, k, batch_size=4000000)[source]

Function that search the k nearest neighbours of a batch of vectors. This implementation is based on vectorized numpy function, it is slower than the search function based on batches of faiss flat indices. We keep this implementation because we can build new functions using this code. Moreover, the distance computation is more precise in numpy than the faiss implementation that optimizes speed over precision.

Parameters:

xq (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)
k (int) – Number of neighbours to retrieve for every vector
batch_size (int) – Size of the batch of vectors that are explored. A bigger value is prefered to avoid multiple loadings of the vectors from the disk.

Returns:

D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.
I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.

autofaiss.indices.search module

function related to search on indices

autofaiss.indices.search.knn_query(index, query, ksearch)[source]

Do a knn search and return a list of the closest items and the associated distance

Return type:: Iterable[Tuple[Tuple[int, int], float]]

autofaiss.indices.training module

Index training

class autofaiss.indices.training.TrainedIndex(index_or_path, index_key, embedding_reader_or_path)[source]

Bases: NamedTuple

embedding_reader_or_path: Union[EmbeddingReader, str, List[str]]: Alias for field number 2

index_key: str: Alias for field number 1

index_or_path: Union[Index, str]: Alias for field number 0

autofaiss.indices.training.create_and_train_index_from_embedding_dir(embedding_root_dirs, embedding_column_name, max_index_memory_usage, make_direct_map, should_be_memory_mappable, current_memory_available, use_gpu=False, index_key=None, id_columns=None, metric_type='ip', nb_cores=None)[source]

Create and train index from embedding directory

Return type:: TrainedIndex

autofaiss.indices.training.create_and_train_new_index(embedding_reader, index_key, metadata, metric_type, current_memory_available, use_gpu=False)[source]

Create and train new index

Return type:: Index

autofaiss.indices.training.create_empty_index(vec_dim, index_key, metric_type)[source]

Create empty index

Return type:: Index

autofaiss.indices package

Submodules

autofaiss.indices.build module

autofaiss.indices.distributed module

autofaiss.indices.faiss_index_wrapper module

autofaiss.indices.index_factory module

autofaiss.indices.index_utils module

autofaiss.indices.memory_efficient_flat_index module

autofaiss.indices.search module

autofaiss.indices.training module

Module contents