autofaiss.external.quantize.build_index

autofaiss.external.quantize.build_index(embeddings, index_path='knn.index', index_infos_path='index_infos.json', ids_path=None, save_on_disk=True, file_format='npy', embedding_column_name='embedding', id_columns=None, index_key=None, index_param=None, max_index_query_time_ms=10.0, max_index_memory_usage='16G', min_nearest_neighbors_to_retrieve=20, current_memory_available='32G', use_gpu=False, metric_type='ip', nb_cores=None, make_direct_map=False, should_be_memory_mappable=False, distributed=None, temporary_indices_folder='hdfs://root/tmp/distributed_autofaiss_indices', verbose=20, nb_indices_to_keep=1)[source]

Reads embeddings and creates a quantized index from them. The index is stored on the current machine at the given output path.

Parameters:
  • embeddings (Union[str, np.ndarray, List[str]]) – Local path containing all preprocessed vectors and cached files. This could be a single directory or multiple directories. Files will be added if empty. Or directly the Numpy array of embeddings

  • index_path (Optional(str)) – Destination path of the quantized model.

  • index_infos_path (Optional(str)) – Destination path of the metadata file.

  • ids_path (Optional(str)) – Only useful when id_columns is not None and file_format=`parquet`. T his will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format

  • save_on_disk (bool) – Whether to save the index on disk, default to True.

  • file_format (Optional(str)) – npy or parquet ; default npy

  • embedding_column_name (Optional(str)) – embeddings column name for parquet ; default embedding

  • id_columns (Optional(List[str])) – Can only be used when file_format=`parquet`. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index ; default None

  • index_key (Optional(str)) – Optional string to give to the index factory in order to create the index. If None, an index is chosen based on an heuristic.

  • index_param (Optional(str)) – Optional string with hyperparameters to set to the index. If None, the hyper-parameters are chosen based on an heuristic.

  • max_index_query_time_ms (float) – Bound on the query time for KNN search, this bound is approximative

  • max_index_memory_usage (str) – Maximum size allowed for the index, this bound is strict

  • min_nearest_neighbors_to_retrieve (int) – Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.

  • current_memory_available (str) – Memory available on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.

  • use_gpu (bool) – Experimental, gpu training is faster, not tested so far

  • metric_type (str) –

    Similarity function used for query:
    • ”ip” for inner product

    • ”l2” for euclidian distance

  • nb_cores (Optional[int]) – Number of cores to use. Will try to guess the right number if not provided

  • make_direct_map (bool) – Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings)

  • should_be_memory_mappable (bool) – If set to true, the created index will be selected only among the indices that can be memory-mapped on disk. This makes it possible to use 50GB indices on a machine with only 1GB of RAM. Default to False

  • distributed (Optional[str]) – If “pyspark”, create the indices using pyspark. Only “parquet” file format is supported.

  • temporary_indices_folder (str) – Folder to save the temporary small indices that are generated by each spark executor. Only used when distributed = “pyspark”.

  • verbose (int) – set verbosity of outputs via logging level, default is logging.INFO

  • nb_indices_to_keep (int) –

    Number of indices to keep at most when distributed is “pyspark”. It allows you to build an index larger than current_memory_available If it is not equal to 1,

    • You are expected to have at most nb_indices_to_keep indices with the following names:

      ”{index_path}i” where i ranges from 1 to nb_indices_to_keep

    • build_index returns a mapping from index path to metrics

    Default to 1.

Return type:

Tuple[Optional[Index], Optional[Dict[str, str]]]