autofaiss.indices package

Submodules

autofaiss.indices.build module

Common functions to build an index

autofaiss.indices.build.add_embeddings_to_index_local(embedding_reader, trained_index_or_path, memory_available_for_adding, embedding_ids_df_handler=None, index_optimizer=None, add_embeddings_with_ids=False)[source]

Add embeddings to index from driver

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.indices.build.get_optimize_index_fn(embedding_reader, index_key, index_path, index_infos_path, use_gpu, save_on_disk, max_index_query_time_ms, min_nearest_neighbors_to_retrieve, make_direct_map, index_param)[source]

Create function to optimize index by choosing best hyperparameters and calculating metrics

Return type:

Callable[[Index, str], Dict]

autofaiss.indices.build.get_write_ids_df_to_parquet_fn(ids_root_dir)[source]

Create function to write ids from Pandas dataframe to parquet

Return type:

Callable[[DataFrame, int], None]

autofaiss.indices.distributed module

Building the index with pyspark.

autofaiss.indices.distributed.add_embeddings_to_index_distributed(trained_index_or_path, embedding_reader, memory_available_for_adding, nb_cores=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', embedding_ids_df_handler=None, nb_indices_to_keep=1, index_optimizer=None)[source]

Create indices by pyspark.

Parameters:
  • trained_index_or_path (trained faiss.Index or path to a trained faiss index) – Trained faiss index

  • embedding_reader (EmbeddingReader) – Embedding reader.

  • memory_available_for_adding (str) – Memory available for adding embeddings.

  • nb_cores (int) – Number of CPU cores per executor

  • temporary_indices_folder (str) – Folder to save the temporary small indices

  • embedding_ids_df_handler (Optional[Callable[[pd.DataFrame, int], Any]]) – The function that handles the embeddings Ids when id_columns is given

  • nb_indices_to_keep (int) – Number of indices to keep at most after the merging step

  • index_optimizer (Optional[Callable]) – The function that optimizes the index

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.indices.distributed.create_big_index(embedding_root_dirs, output_root_dir, ss, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, current_memory_available='32G', nb_cores=None, use_gpu=False, metric_type='ip', nb_splits_per_big_index=1, make_direct_map=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices')[source]

Create a big index

Return type:

Optional[Dict[str, str]]

autofaiss.indices.distributed.create_partitioned_indexes(partitions, big_index_threshold, output_root_dir, nb_cores, nb_splits_per_big_index, id_columns=None, max_index_query_time_ms=10.0, min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, max_index_memory_usage='16G', current_memory_available='32G', use_gpu=False, metric_type='ip', make_direct_map=False, should_be_memory_mappable=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', maximum_nb_threads=256)[source]

Create partitioned indexes from a list of parquet partitions, i.e. create and train one index per parquet partition

Return type:

List[Optional[Dict[str, str]]]

autofaiss.indices.distributed.create_small_index(embedding_root_dirs, output_root_dir, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False)[source]

Create a small index

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]

autofaiss.indices.faiss_index_wrapper module

This file contains a wrapper class to create Faiss-like indices

class autofaiss.indices.faiss_index_wrapper.FaissIndexWrapper(d, metric_type)[source]

Bases: ABC

This abstract class is describing a Faiss-like index It is useful to use this wrapper to use benchmarking functions written for faiss in this library

abstract add(x)[source]

Function that adds vectors to the index

Parameters:

x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

abstract search(x, k)[source]

Function that search the k nearest neighbours of a batch of vectors

Parameters:
  • x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

  • k (int) – Number of neighbours to retrieve for every vector

Returns:

  • D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.

  • I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.

autofaiss.indices.index_factory module

functions that fixe faiss index_factory function

autofaiss.indices.index_factory.index_factory(d, index_key, metric_type, ef_construction=None)[source]

custom index_factory that fix some issues of faiss.index_factory with inner product metrics.

autofaiss.indices.index_utils module

useful functions to apply on an index

autofaiss.indices.index_utils.format_speed_ms_per_query(speed)[source]

format the speed (ms/query) into a nice string

Return type:

str

autofaiss.indices.index_utils.get_bytes_from_index(index)[source]

Transforms a faiss index into a bytearray.

Return type:

bytearray

autofaiss.indices.index_utils.get_index_from_bytes(index_bytes)[source]

Transforms a bytearray containing a faiss index into the corresponding object.

Return type:

Index

autofaiss.indices.index_utils.get_index_size(index)[source]

Returns the size in RAM of a given index

Return type:

int

autofaiss.indices.index_utils.initialize_direct_map(index)[source]
Return type:

None

autofaiss.indices.index_utils.load_index(index_src_path, index_dst_path)[source]
Return type:

Index

autofaiss.indices.index_utils.parallel_download_indices_from_remote(fs, indices_file_paths, dst_folder)[source]

Download small indices in parallel.

autofaiss.indices.index_utils.quantize_vec_without_modifying_index(index, vecs)[source]

qantize a batch of vectors

Return type:

ndarray

autofaiss.indices.index_utils.save_index(index, root_dir, index_filename)[source]

Save index

Return type:

str

autofaiss.indices.index_utils.search_speed_test(index, query=None, ksearch=40, timout_s=10.0)[source]

return the average and 99p search speed

Return type:

Dict[str, float]

autofaiss.indices.index_utils.set_search_hyperparameters(index, param_str, use_gpu=False)[source]

set hyperparameters to an index

Return type:

None

autofaiss.indices.index_utils.speed_test_ms_per_query(index, query=None, ksearch=40, timout_s=5.0)[source]

Evaluate the average speed in milliseconds of the index without using batch

Return type:

float

autofaiss.indices.memory_efficient_flat_index module

This file contain a class describing a memory efficient flat index

class autofaiss.indices.memory_efficient_flat_index.MemEfficientFlatIndex(d, metric_type)[source]

Bases: FaissIndexWrapper

Faiss-like Flat index that can support any size of vectors without memory issues. Two search functions are available to use either batch of smaller faiss flat index or rely fully on numpy.

add(x)[source]

Function that adds vectors to the index

Parameters:

x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

add_all(filename, nb_items)[source]

Function that adds vectors to the index from a memmory-mapped array

Parameters:
  • filename (string) – path of the 2D numpy array of shape (nb_items, vector_dim) on the disk

  • nb_items (int) – number of vectors in the 2D array (the dim is already known)

add_files(embedding_reader)[source]
delete_vectors()[source]

delete the vectors of the index

search(x, k, batch_size=4000000)[source]

Function that search the k nearest neighbours of a batch of vectors

Parameters:
  • x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

  • k (int) – Number of neighbours to retrieve for every vector

  • batch_size (int) – Size of the batch of vectors that are explored. A bigger value is prefered to avoid multiple loadings of the vectors from the disk.

Returns:

  • D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.

  • I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.

search_files(x, k, batch_size)[source]
search_numpy(xq, k, batch_size=4000000)[source]

Function that search the k nearest neighbours of a batch of vectors. This implementation is based on vectorized numpy function, it is slower than the search function based on batches of faiss flat indices. We keep this implementation because we can build new functions using this code. Moreover, the distance computation is more precise in numpy than the faiss implementation that optimizes speed over precision.

Parameters:
  • xq (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)

  • k (int) – Number of neighbours to retrieve for every vector

  • batch_size (int) – Size of the batch of vectors that are explored. A bigger value is prefered to avoid multiple loadings of the vectors from the disk.

Returns:

  • D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.

  • I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.

autofaiss.indices.search module

function related to search on indices

autofaiss.indices.search.knn_query(index, query, ksearch)[source]

Do a knn search and return a list of the closest items and the associated distance

Return type:

Iterable[Tuple[Tuple[int, int], float]]

autofaiss.indices.training module

Index training

class autofaiss.indices.training.TrainedIndex(index_or_path, index_key, embedding_reader_or_path)[source]

Bases: NamedTuple

embedding_reader_or_path: Union[EmbeddingReader, str, List[str]]

Alias for field number 2

index_key: str

Alias for field number 1

index_or_path: Union[Index, str]

Alias for field number 0

autofaiss.indices.training.create_and_train_index_from_embedding_dir(embedding_root_dirs, embedding_column_name, max_index_memory_usage, make_direct_map, should_be_memory_mappable, current_memory_available, use_gpu=False, index_key=None, id_columns=None, metric_type='ip', nb_cores=None)[source]

Create and train index from embedding directory

Return type:

TrainedIndex

autofaiss.indices.training.create_and_train_new_index(embedding_reader, index_key, metadata, metric_type, current_memory_available, use_gpu=False)[source]

Create and train new index

Return type:

Index

autofaiss.indices.training.create_empty_index(vec_dim, index_key, metric_type)[source]

Create empty index

Return type:

Index

Module contents