Skip to content

Refactor endpoint types to use C++ inheritance #851

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

AvivBenchorin
Copy link
Contributor

Draft PR for a first iteration of using proper C++ inheritance in the OFI NCCL plugin.

Replaces the C-style inheritance of having a base endpoint instance as the first member of the transport endpoint types with the C++ approach of making nccl_net_ofi_ep_t a parent class with virtual functions that the transport endpoint types override. This also refactors the allocation and free functions for the base and transport endpoint types to be constructors and destructors.

Some changes this draft PR does not yet include:

  1. Moving endpoint constructor and destructor helper functions to be private member functions. For example, nccl_net_ofi_rdma_ep_t's constructor calls helper functions init_rail_ofi_resources and init_rx_buffers, which are currently free functions in nccl_ofi_rdma.cpp and should be private member functions of nccl_net_ofi_rdma_ep_t
  2. Sort out constructor and destructor for endpoint rails: nccl_net_ofi_ep_rail_t is currently initialized and torn-down using free functions ep_rail_init and ep_rail_release, which could be turned into the rail's constructor and destructor respectively

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from b152394 to f47456c Compare April 10, 2025 21:44
@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from bbd2e8e to f47456c Compare April 14, 2025 18:39
@AvivBenchorin
Copy link
Contributor Author

I made a mistake by undesirably sync-ing my feature branch with master on Github, which then pushed to the PR. I've force pushed the same version of the PR that had been reviewed, and I will push again once I've actually updated the PR to address the received feedback.

@vidsouza
Copy link
Collaborator

bot:aws:retest

2 similar comments
@AvivBenchorin
Copy link
Contributor Author

bot:aws:retest

@AvivBenchorin
Copy link
Contributor Author

bot:aws:retest

Copy link
Contributor

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bunch of whitespace inconsistencies and things like that which need cleaning up, but generally this seems to be the right next step.

@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from f47456c to 2140b2e Compare April 16, 2025 21:34
@AvivBenchorin
Copy link
Contributor Author

AvivBenchorin commented Apr 16, 2025

The latest force-push to the PR both addresses feedback from @bwbarrett, and refactors all free functions associated with the SENDRECV and RDMA transport endpoint types to be member functions. I also updated the RDMA endpoint's control and data rail arrays and deque from being member pointers to being member instances (if this change isn't relevant enough I can punt it to a later PR). Some call-outs for the draft PR in its current state:

  1. I did not update the RDMA endpoint rail type in any way: it is currently still a publicly accessible type and its members can be directly accessed by non-endpoint types. There has been internal discussion regarding whether the endpoint rail should be more encapsulated by the endpoint type, and since the encapsulation could require significant changes to non-endpoint functions and types, I would personally prefer to punt it to a later PR to avoid scope-creeping this one.
  2. I did not update the rx_buff_req_alloc function pointer in the endpoint rail type, or the ctrl_rx_buff_req_alloc and eager_rx_buff_req_alloc functions that are assigned to the rx_buff_req_alloc pointer. I feel that ctrl_rx_buff_req_alloc and eager_rx_buff_req_alloc should be refactored as the constructor for a rx_buff RDMA request, and wanted to avoid coupling the functions closer to endpoint if turning them into member functions of the endpoint or endpoint rail types. Thus I lean towards punting addressing rx_buff_req_alloc to a PR where I more broadly refactor RDMA requests.
  3. I've split up the PR to have the endpoint inheritance refactor and member function refactors in separate commits. This means that the endpoint inheritance commit (Refactor endpoint types to use C++ inheritance) still has endpoint function calls with this being the first argument (e.g. as called out here). If this wouldn't be a sane state to have the code in at this commit I can merge the commits together, but I'm concerned that the consolidated commit would have too many different functional changes and be more difficult to revert.
  4. The PR has been failing on all Trainium platform CI tests, which I'll need to investigate and root cause whether its my code causing the errors.

@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from 2140b2e to b11b94c Compare April 17, 2025 21:13
@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from b11b94c to 3a70bbb Compare April 17, 2025 23:32
@AvivBenchorin AvivBenchorin marked this pull request as ready for review April 22, 2025 15:30
@AvivBenchorin AvivBenchorin requested a review from a team as a code owner April 22, 2025 15:30
@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch 2 times, most recently from cb10c71 to ffcffd3 Compare June 13, 2025 16:26
@AvivBenchorin
Copy link
Contributor Author

Previous revision testing failed with:

nccl_ofi_rdma.cpp:1227:67: error: lvalue required as unary ‘&’ operand
   assert(rx_buff_data->rail == &ep->rdma_endpoint_get_rail(rail_id));
                                                                   ^
nccl_ofi_rdma.cpp:1230:75: error: lvalue required as unary ‘&’ operand
   assert(rx_buff_data->rail == &ep->rdma_endpoint_get_control_rail(rail_id));

Should be fixed in latest force push, along with some comment fixes.

Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made it through first commit; waypoint comments posted.

Copy link
Contributor

@rauteric rauteric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got through the rest of the PR. Added a few more minor comments.

@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from ffcffd3 to f18a1e8 Compare June 17, 2025 22:42
Replaces the C-style inheritance of having a base endpoint instance as
the first member of the transport endpoint types with the C++ approach
of making nccl_net_ofi_ep_t a parent class with virtual functions that
the transport endpoint types override. This also refactors the
allocation and free functions for the base and transport endpoint types
to be constructors and destructors.

Signed-off-by: Aviv Benchorin <[email protected]>
@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from f18a1e8 to b9e362a Compare June 18, 2025 00:41
Refactors all stand-alone functions that should belong to the RDMA
endpoint type to become member functions, including functions associated
with RDMA endpoint rails. Helper functions for the RDMA endpoint
constructor and destructor were kept as private functions. Initializing
and cleaning up the RDMA endpoint rails has been kept as a part of the
endpoint constructor and destructor rather than moved to a separate
constructor and destructor for the endpoint rail.

This commit does not refactor the `rx_buff_req_alloc` function pointer
or free functions associated with it, this refactor should wait until
a later PR that more broadly refactors the RDMA request types.

Signed-off-by: Aviv Benchorin <[email protected]>
Refactors all stand-alone functions that should belong to the SENDRECV
transport endpoint type to become member functions.

Signed-off-by: Aviv Benchorin <[email protected]>
Previously, the RDMA transport endpoint had endpoint rail pointers to an
array of control rails and an array of data rails, which required
manually allocating and freeing the memory associated with the arrays.
By replacing the endpoint rail pointers with std::vector instances, the
RDMA endpoint contructor and destructor can automatically manage the
rails' memory.

Signed-off-by: Aviv Benchorin <[email protected]>
Previously, the RDMA transport endpoint had a std::deque pointer which
required manually allocating and freeing the memory associated with the
deque. By replacing the std::deque pointer with an std::deque instance,
the RDMA endpoint constructor and destructor can automatically manage
the deque's memory.

Signed-off-by: Aviv Benchorin <[email protected]>
@AvivBenchorin AvivBenchorin force-pushed the feature/base_endpoint_inheritance branch from b9e362a to 88219d0 Compare June 18, 2025 21:33
@AvivBenchorin
Copy link
Contributor Author

After comparing my changes against the C++ code already added to the codebase, I see I've been using the this keyword excessively (I used it when referencing every endpoint member variable or function). Latest force push removes all of my unnecessary uses of this.

@rauteric
Copy link
Contributor

After comparing my changes against the C++ code already added to the codebase, I see I've been using the this keyword excessively (I used it when referencing every endpoint member variable or function). Latest force push removes all of my unnecessary uses of this.

Ha, I actually mildly preferred the this to make clear when we are referencing member variables. But we won't split hairs. :)

@bwbarrett
Copy link
Contributor

After comparing my changes against the C++ code already added to the codebase, I see I've been using the this keyword excessively (I used it when referencing every endpoint member variable or function). Latest force push removes all of my unnecessary uses of this.

Ha, I actually mildly preferred the this to make clear when we are referencing member variables. But we won't split hairs. :)

I don't think this is a place where we should be hyper pedantic. In something like the environ code, where we're talking about 4-5 variables, I think explicitly using this-> adds text without adding clarity and probably isn't helpful. But especially for some of the more complex code where there is a ton of state, I do think the this-> adds clarity. So I would say do what you think is most readable.

@xwangxin
Copy link

I remember more than 20 years ago, "m_" prefix was widely used for class member variables, is it now less preferred?

@AvivBenchorin
Copy link
Contributor Author

Since I have approvals, I will resolve comments and merge this PR. I can address readability/formatting improvement in a fast-follow PR.

@AvivBenchorin AvivBenchorin merged commit 024fb2e into aws:master Jun 19, 2025
24 checks passed
@AvivBenchorin AvivBenchorin deleted the feature/base_endpoint_inheritance branch June 19, 2025 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants