Skip to content

Faiss IVFPQ (GPU) Enable raft-on cause /dev/nvidiactl ioctl race #4272

Closed
@neezeeyee

Description

@neezeeyee

Severe Performance Degradation Due to NVIDIA Driver (nvidiactl) ioctl Contention When RAFT is Enabled in Faiss GPU Build

Environment

  • Faiss Version: 1.9.0

  • RAFT Version: 24.06.00

  • GPU

  • Faiss Build Configuration:

  # Docker build snippet
RUN set -ex \
  && mkdir -p /root/logs \
  && rm -rf /lib64/libstdc++.so.6.0.25-gdb.py \
  && ldconfig \
  && cd cmake-3.29.4 \
  && ./configure --prefix=/usr/local/cmake \
  && make && make install \
  && ln -s /usr/local/cmake/bin/cmake /usr/bin/cmake \
  && cd .. && rm -rf cmake-3.29.4 \
  && cd raft-24.06.00 \
  && ./build.sh libraft \
  && cd ..

ENV raft_DIR=/home/raft-24.06.00/cpp/build/

RUN cd faiss-1.9.0 \
  && cmake -B build . -DFAISS_ENABLE_GPU=ON -DFAISS_ENABLE_PYTHON=OFF -DBUILD_TESTING=OFF -DCUDAToolkit_ROOT=/usr/local/cuda/targets/x86_64-linux/include -DBUILD_SHARED_LIBS=ON -DFAISS_ENABLE_RAFT=ON  -DCMAKE_BUILD_TYPE=Release  -DCMAKE_CUDA_ARCHITECTURES="70;75;80;86;90" \
  && make -C build install -j6 && cd .. \
  && rm -rf faiss-1.9.0

Problem Description

We run GPU searches using the following workflow

std::unique_ptr<faiss::Index> index_cpu_ptr(faiss::read_index(faiss_local_file_path.c_str(), 0));
faiss::Index* cpu_ptr = cpu_unique_ptr.release();
 faiss::gpu::StandardGpuResources res;
faiss::Index* index_gpu = faiss::gpu::index_cpu_to_gpu(&res, 0, cpu_ptr, &options);
index_gpu->search();

Within a single k8s Pod, multiple GPU indices (gpu_index) may coexist, where each index is bound to an isolated StandardGpuResources, Under normal circumstances, search operations across these indices should not interfere with each other due to resource isolation.

1. Cross-Pod Driver-Level Contention

When monitoring multiple k8s Pods sharing the same GPU host, we observed heavy contention through:

@ioctl_file[b, /dev/nvidiactl, NV_ESC_RM_FREE]:  // 
[0]                 2248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@ioctl_file[b /dev/nvidiactl, NV_ESC_RM_VID_HEAP_CONTROL]: 
[0]                 2248 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

2. Within-Pod Contention

raft race
Image

Solve

remove DFAISS_ENABLE_RAFT=ON and recomplie faiss

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions