store_dataset CAGRA parameter

Hi,

I am trying to build a GPU index using `GpuIndexCagra`, and was under the impression that the `store_dataset` parameter prevents the dataset from being attached to the index. I am converting the index to HNSW afterwards, so I don't want to load the dataset into GPU memory. Recently, there was a PR that seemed to address this: https://github.com/facebookresearch/faiss/pull/4173. 

However, for a 6.144 GB example dataset, I noticed that the GPU memory spiked to as high as 10.5 GB, when I monitored the GPU usage with `nvidia-smi` in the background. The code I'm using to test is here: https://github.com/navneet1v/VectorSearchForge/tree/main/cuvs_benchmarks. Specifically, this function: https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L324 is used to build the index on a GPU

Interestingly, when I used the `numpy` `mmap` feature to load the dataset, I did not see the GPU memory exceed 5.039 GB. This was regardless of the value I set the `store_dataset` parameter. It looks like CAGRA supports keeping the dataset on disk, so that is probably the reason why the GPU memory doesn't spike. However, we want to see if it's possible to keep the dataset entirely in CPU memory, without loading it into GPU memory and without using disk. Is the `store_dataset` parameter supposed to do this? If not, is there any other way to do this with the faiss python API? Please let me know, thank you!

**Additional Background**
Faiss version: We are using Faiss as a git submodule, with version 1.10.0, and the submodule is pointed to commit df6a8f6b4e6ed4c509e52d1e015f87fd742c17df 

https://github.com/facebookresearch/faiss/commit/df6a8f6b4e6ed4c509e52d1e015f87fd742c17df

OS version: 
```
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.6.20250218"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2029-06-30"
```

Type of GPU:
```
00:1e.0 3D controller: NVIDIA Corporation GA102GL [A10G] (rev a1)
```

EC2 Instance Type: g5.2xlarge

**Reproduction Instructions**
1. On a server with GPUs, clone https://github.com/navneet1v/VectorSearchForge
    - Server must have `git` and `docker` installed
    - Server must have `nvidia` developer tools installed, such as `nvidia-smi` and `nvidia-container-toolkit`
2. `cd` into `cuvs_benchmarks` folder, and create a temp directory to store the faiss graph files:
```
mkdir ./benchmarks_files
chmod 777 ./benchmarks_files
```
3. Build the docker image: 
```
docker build -t <your_image_name> .
```
4. Run the image:
```
docker run -v ./benchmarks_files:/tmp/files --gpus all <your_image_name>
```
In a separate terminal, run `nvidia-smi` to monitor the GPU memory:
```
nvidia-smi --query-gpu=timestamp,utilization.gpu,memory.used,temperature.gpu --format=csv -l 1
```

For loading the `numpy` dataset with mmap, I added the following lines below https://github.com/navneet1v/VectorSearchForge/blob/main/cuvs_benchmarks/main.py#L253:

```
    # Line 1-253 code above ...
    np.save("array.npy",xb)
    del xb
    xb = np.load("array.npy", mmap_mode='r+')
    # rest of code below ...
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

store_dataset CAGRA parameter #4274

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

store_dataset CAGRA parameter #4274

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions