autofaiss.indices package
Submodules
autofaiss.indices.build module
Common functions to build an index
- autofaiss.indices.build.add_embeddings_to_index_local(embedding_reader, trained_index_or_path, memory_available_for_adding, embedding_ids_df_handler=None, index_optimizer=None, add_embeddings_with_ids=False)[source]
Add embeddings to index from driver
- autofaiss.indices.build.get_optimize_index_fn(embedding_reader, index_key, index_path, index_infos_path, use_gpu, save_on_disk, max_index_query_time_ms, min_nearest_neighbors_to_retrieve, make_direct_map, index_param)[source]
Create function to optimize index by choosing best hyperparameters and calculating metrics
autofaiss.indices.distributed module
Building the index with pyspark.
- autofaiss.indices.distributed.add_embeddings_to_index_distributed(trained_index_or_path, embedding_reader, memory_available_for_adding, nb_cores=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', embedding_ids_df_handler=None, nb_indices_to_keep=1, index_optimizer=None)[source]
Create indices by pyspark.
- Parameters:
trained_index_or_path (trained faiss.Index or path to a trained faiss index) – Trained faiss index
embedding_reader (EmbeddingReader) – Embedding reader.
memory_available_for_adding (str) – Memory available for adding embeddings.
nb_cores (int) – Number of CPU cores per executor
temporary_indices_folder (str) – Folder to save the temporary small indices
embedding_ids_df_handler (Optional[Callable[[pd.DataFrame, int], Any]]) – The function that handles the embeddings Ids when id_columns is given
nb_indices_to_keep (int) – Number of indices to keep at most after the merging step
index_optimizer (Optional[Callable]) – The function that optimizes the index
- Return type:
- autofaiss.indices.distributed.create_big_index(embedding_root_dirs, output_root_dir, ss, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, current_memory_available='32G', nb_cores=None, use_gpu=False, metric_type='ip', nb_splits_per_big_index=1, make_direct_map=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices')[source]
Create a big index
- autofaiss.indices.distributed.create_partitioned_indexes(partitions, big_index_threshold, output_root_dir, nb_cores, nb_splits_per_big_index, id_columns=None, max_index_query_time_ms=10.0, min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, max_index_memory_usage='16G', current_memory_available='32G', use_gpu=False, metric_type='ip', make_direct_map=False, should_be_memory_mappable=False, temp_root_dir='hdfs://root/tmp/distributed_autofaiss_indices', maximum_nb_threads=256)[source]
Create partitioned indexes from a list of parquet partitions, i.e. create and train one index per parquet partition
- autofaiss.indices.distributed.create_small_index(embedding_root_dirs, output_root_dir, id_columns=None, should_be_memory_mappable=False, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, embedding_column_name='embedding', index_key=None, index_path=None, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False)[source]
Create a small index
autofaiss.indices.faiss_index_wrapper module
This file contains a wrapper class to create Faiss-like indices
- class autofaiss.indices.faiss_index_wrapper.FaissIndexWrapper(d, metric_type)[source]
Bases:
ABC
This abstract class is describing a Faiss-like index It is useful to use this wrapper to use benchmarking functions written for faiss in this library
- abstract add(x)[source]
Function that adds vectors to the index
- Parameters:
x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)
- abstract search(x, k)[source]
Function that search the k nearest neighbours of a batch of vectors
- Parameters:
x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)
k (int) – Number of neighbours to retrieve for every vector
- Returns:
D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.
I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.
autofaiss.indices.index_factory module
functions that fixe faiss index_factory function
autofaiss.indices.index_utils module
useful functions to apply on an index
- autofaiss.indices.index_utils.format_speed_ms_per_query(speed)[source]
format the speed (ms/query) into a nice string
- Return type:
- autofaiss.indices.index_utils.get_bytes_from_index(index)[source]
Transforms a faiss index into a bytearray.
- Return type:
- autofaiss.indices.index_utils.get_index_from_bytes(index_bytes)[source]
Transforms a bytearray containing a faiss index into the corresponding object.
- Return type:
Index
- autofaiss.indices.index_utils.get_index_size(index)[source]
Returns the size in RAM of a given index
- Return type:
- autofaiss.indices.index_utils.load_index(index_src_path, index_dst_path)[source]
- Return type:
Index
- autofaiss.indices.index_utils.parallel_download_indices_from_remote(fs, indices_file_paths, dst_folder)[source]
Download small indices in parallel.
- autofaiss.indices.index_utils.quantize_vec_without_modifying_index(index, vecs)[source]
qantize a batch of vectors
- Return type:
- autofaiss.indices.index_utils.save_index(index, root_dir, index_filename)[source]
Save index
- Return type:
- autofaiss.indices.index_utils.search_speed_test(index, query=None, ksearch=40, timout_s=10.0)[source]
return the average and 99p search speed
autofaiss.indices.memory_efficient_flat_index module
This file contain a class describing a memory efficient flat index
- class autofaiss.indices.memory_efficient_flat_index.MemEfficientFlatIndex(d, metric_type)[source]
Bases:
FaissIndexWrapper
Faiss-like Flat index that can support any size of vectors without memory issues. Two search functions are available to use either batch of smaller faiss flat index or rely fully on numpy.
- add(x)[source]
Function that adds vectors to the index
- Parameters:
x (2D numpy.array of floats) – Batch of vectors of shape (batch_size, vector_dim)
- add_all(filename, nb_items)[source]
Function that adds vectors to the index from a memmory-mapped array
- Parameters:
filename (string) – path of the 2D numpy array of shape (nb_items, vector_dim) on the disk
nb_items (int) – number of vectors in the 2D array (the dim is already known)
- search(x, k, batch_size=4000000)[source]
Function that search the k nearest neighbours of a batch of vectors
- Parameters:
- Returns:
D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.
I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.
- search_numpy(xq, k, batch_size=4000000)[source]
Function that search the k nearest neighbours of a batch of vectors. This implementation is based on vectorized numpy function, it is slower than the search function based on batches of faiss flat indices. We keep this implementation because we can build new functions using this code. Moreover, the distance computation is more precise in numpy than the faiss implementation that optimizes speed over precision.
- Parameters:
- Returns:
D (2D numpy.array of floats) – Distances numpy array of shape (batch_size, k). Contains the distances computed by the index of the k nearest neighbours.
I (2D numpy.array of ints) – Labels numpy array of shape (batch_size, k). Contains the vectors’ labels of the k nearest neighbours.
autofaiss.indices.search module
function related to search on indices
autofaiss.indices.training module
Index training
- class autofaiss.indices.training.TrainedIndex(index_or_path, index_key, embedding_reader_or_path)[source]
Bases:
NamedTuple
- autofaiss.indices.training.create_and_train_index_from_embedding_dir(embedding_root_dirs, embedding_column_name, max_index_memory_usage, make_direct_map, should_be_memory_mappable, current_memory_available, use_gpu=False, index_key=None, id_columns=None, metric_type='ip', nb_cores=None)[source]
Create and train index from embedding directory
- Return type: