Skip to content

store_dataset CAGRA parameter #4274

Closed
Closed
@rchitale7

Description

@rchitale7

Hi,

I am trying to build a GPU index using GpuIndexCagra, and was under the impression that the store_dataset parameter prevents the dataset from being attached to the index. I am converting the index to HNSW afterwards, so I don't want to load the dataset into GPU memory. Recently, there was a PR that seemed to address this: #4173.

However, for a 6.144 GB example dataset, I noticed that the GPU memory spiked to as high as 10.5 GB, when I monitored the GPU usage with nvidia-smi in the background. The code I'm using to test is here: https://github.com/navneet1v/VectorSearchForge/tree/main/cuvs_benchmarks. Specifically, this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L324 is used to build the index on a GPU

Interestingly, when I used the numpy mmap feature to load the dataset, I did not see the GPU memory exceed 5.039 GB. This was regardless of the value I set the store_dataset parameter. It looks like CAGRA supports keeping the dataset on disk, so that is probably the reason why the GPU memory doesn't spike. However, we want to see if it's possible to keep the dataset entirely in CPU memory, without loading it into GPU memory and without using disk. Is the store_dataset parameter supposed to do this? If not, is there any other way to do this with the faiss python API? Please let me know, thank you!

Additional Background
Faiss version: We are using Faiss as a git submodule, with version 1.10.0, and the submodule is pointed to commit df6a8f6

df6a8f6

OS version:

NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.6.20250218"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2029-06-30"

Type of GPU:

00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)

EC2 Instance Type: g5.2xlarge

Reproduction Instructions

  1. On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
    • Server must have git and docker installed
    • Server must have nvidia developer tools installed, such as nvidia-smi and nvidia-container-toolkit
  2. cd into cuvs_benchmarks folder, and create a temp directory to store the faiss graph files:
mkdir ./benchmarks_files
chmod 777 ./benchmarks_files
  1. Build the docker image:
docker build -t <your_image_name> .
  1. Run the image:
docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>

In a separate terminal, run nvidia-smi to monitor the GPU memory:

nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,temperature.gpu --format=csv -l 1

For loading the numpy dataset with mmap, I added the following lines below https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L253:

    # Line 1-253 code above ...
    np.save("array.npy",xb)
    del xb
    xb = np.load("array.npy", mmap_mode='r+')
    # rest of code below ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions