Skip to content

AWS OFI NCCL v1.15.0

Compare
Choose a tag to compare
@AvivBenchorin AvivBenchorin released this 04 Jun 04:18
· 105 commits to master since this release
v1.15.0

v1.15.0 (2025-06)

The 1.15.x release series supports NCCL 2.26.6-1 while maintaining backward compatibility with older NCCL versions (NCCL v2.17.1 and later).

With this release, building with platform-aws requires Libfabric v1.22.0amzn4.0 or greater. And it is currently tested with versions up to Libfabric 2.1.0amzn3.

Bug Fixes and Improvements:

  • Build system and platform support
    • Added AWS P6-B200 platform support
    • Changed default plugin library name to libnccl-net-ofi.so, and by default create symlink from libnccl-net-ofi.so to libnccl-net.so to maintain backward compatibility. This allows users to set NCCL_NET_PLUGIN=ofi to force NCCL to use the OFI plugin for communication. Specifying --disable-nccl-net-symlink to configure will skip the symlink, allowing multiple plugins to be installed in the same container.
  • Tuning and performance improvements
    • Added tuner support on P6-B200 for AllReduce, AllGather, and ReduceScatter regions for 0x0 and 0x7 bitmask
    • Updated default latency for P5en and P6-B200 platforms based on empirical results and analysis
  • Update to use NCCL v10 API with trafficClass parameter support for future traffic prioritization
  • Migrated plugin code base from C to C++
  • Added support for jobs where the number of NICs per GPU is different across systems. See the OFI_NCCL_FORCE_NUM_RAILS runtime environment variable documentation for more information.

OFI NCCL plugin runtime environment variable changes:

Deprecated environment variables

  • OFI_NCCL_RDMA_MIN_POSTED_BOUNCE_BUFFERS
  • OFI_NCCL_RDMA_MAX_POSTED_BOUNCE_BUFFERS

New environment variables

  • OFI_NCCL_SCHED_MAX_SMALL_RR_SIZE
  • OFI_NCCL_RDMA_MIN_POSTED_EAGER_BUFFERS
  • OFI_NCCL_RDMA_MAX_POSTED_EAGER_BUFFERS
  • OFI_NCCL_RDMA_MIN_POSTED_CONTROL_BUFFERS
  • OFI_NCCL_RDMA_MAX_POSTED_CONTROL_BUFFERS
  • OFI_NCCL_CQ_SIZE

Updated environment variables defaults

  • OFI_NCCL_RR_CTRL_MSG: default changed from 0 to 1

Checksum (sha512) for the release tarball aws-ofi-nccl-1.15.0.tar.gz:

9d529512927d3b2d1387f942283846889d0679dfd21b427f72e90d89d43bceb301e9f839a0290df3accb1ca9929818e811b94517241722becf6878d6d8646242  aws-ofi-nccl-1.15.0.tar.gz