Faiss IVFPQ (GPU) Enable raft-on cause /dev/nvidiactl ioctl race

## Severe Performance Degradation Due to NVIDIA Driver (nvidiactl) ioctl Contention When RAFT is Enabled in Faiss GPU Build

## Environment
- **Faiss Version**: 1.9.0
- **RAFT Version**: 24.06.00
- GPU

- **Faiss Build Configuration**:
```dockerfile
  # Docker build snippet
RUN set -ex \
  && mkdir -p /root/logs \
  && rm -rf /lib64/libstdc++.so.6.0.25-gdb.py \
  && ldconfig \
  && cd cmake-3.29.4 \
  && ./configure --prefix=/usr/local/cmake \
  && make && make install \
  && ln -s /usr/local/cmake/bin/cmake /usr/bin/cmake \
  && cd .. && rm -rf cmake-3.29.4 \
  && cd raft-24.06.00 \
  && ./build.sh libraft \
  && cd ..

ENV raft_DIR=/home/raft-24.06.00/cpp/build/

RUN cd faiss-1.9.0 \
  && cmake -B build . -DFAISS_ENABLE_GPU=ON -DFAISS_ENABLE_PYTHON=OFF -DBUILD_TESTING=OFF -DCUDAToolkit_ROOT=/usr/local/cuda/targets/x86_64-linux/include -DBUILD_SHARED_LIBS=ON -DFAISS_ENABLE_RAFT=ON  -DCMAKE_BUILD_TYPE=Release  -DCMAKE_CUDA_ARCHITECTURES="70;75;80;86;90" \
  && make -C build install -j6 && cd .. \
  && rm -rf faiss-1.9.0
```

# Problem Description

We run GPU searches using the following workflow
```
std::unique_ptr<faiss::Index> index_cpu_ptr(faiss::read_index(faiss_local_file_path.c_str(), 0));
faiss::Index* cpu_ptr = cpu_unique_ptr.release();
 faiss::gpu::StandardGpuResources res;
faiss::Index* index_gpu = faiss::gpu::index_cpu_to_gpu(&res, 0, cpu_ptr, &options);
index_gpu->search();
```
Within a single k8s Pod, multiple GPU indices (`gpu_index`) may coexist, where each index is bound to an isolated `StandardGpuResources`, Under normal circumstances, search operations across these indices **should not interfere with each other** due to resource isolation.


#### 1. Cross-Pod Driver-Level Contention
When monitoring multiple k8s Pods **sharing the same GPU host**, we observed heavy contention through:

```
@ioctl_file[b, /dev/nvidiactl, NV_ESC_RM_FREE]:  // 
[0]                 2248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@ioctl_file[b /dev/nvidiactl, NV_ESC_RM_VID_HEAP_CONTROL]: 
[0]                 2248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
```

#### 2. Within-Pod  Contention
raft race
![Image](https://github.com/user-attachments/assets/f7b0e8db-db07-4e73-afa3-689f385a8d15)

# Solve
remove DFAISS_ENABLE_RAFT=ON and recomplie faiss




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faiss IVFPQ (GPU) Enable raft-on cause /dev/nvidiactl ioctl race #4272

Severe Performance Degradation Due to NVIDIA Driver (nvidiactl) ioctl Contention When RAFT is Enabled in Faiss GPU Build

Environment

Problem Description

1. Cross-Pod Driver-Level Contention

2. Within-Pod Contention

Solve

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Faiss IVFPQ (GPU) Enable raft-on cause /dev/nvidiactl ioctl race #4272

Description

Severe Performance Degradation Due to NVIDIA Driver (nvidiactl) ioctl Contention When RAFT is Enabled in Faiss GPU Build

Environment

Problem Description

1. Cross-Pod Driver-Level Contention

2. Within-Pod Contention

Solve

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions