Skip to content

Feature: Use NMX-C gRPC API for NVLink partition management#1580

Open
rtamma-nv wants to merge 23 commits into
NVIDIA:mainfrom
rtamma-nv:feature/nmxc-paritioning
Open

Feature: Use NMX-C gRPC API for NVLink partition management#1580
rtamma-nv wants to merge 23 commits into
NVIDIA:mainfrom
rtamma-nv:feature/nmxc-paritioning

Conversation

@rtamma-nv
Copy link
Copy Markdown
Contributor

@rtamma-nv rtamma-nv commented May 12, 2026

Description

Use NMX-C grpc API directly for managing NVLink partitions. Previously, core used to communicate with NMX-M for partition management. These changes remove all NMX-M support and replace it with NMX-C equivalent support.
There are no changes to tenant/user facing APIs for logical partition management or the general algorithm in how nvlink_partition_monitor applies tenant defined nvlink config and returns status/observations.

This PR brings in:
. new admin-cli command to add rack serial to NMX-C endpoint mappings
. libnmxc grpc client library for NMX-C
. new integration tests that support interacting with a real NMX-C running in simulator mode (in addition to mock)
. version 2.5 of nmx_c.proto
. replacement of all functionality provided by NMX-M rest API with NMX-C gRPC equivalent.
. other associated miscellaneous changes like nmxc_browser and metrics

Relevant NMX-C documentation changes will be part of a subsequent PR.

Type of Change

  • Add - New feature or capability
  • Change - Changes in existing functionality
  • Fix - Bug fixes
  • Remove - Removed features or deprecated functionality
  • Internal - Internal changes (refactoring, tests, docs, etc.)

Related Issues (Optional)

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

… instead of NMX-M.

Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Add integration tests that communicate with a real nmx-c simulator rather than mock. These tests are not run unless RUN_NMXC_SIMULATOR_TESTS env variable is set.

Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
…ulator

Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
@rtamma-nv rtamma-nv requested review from a team and Coco-Ben as code owners May 12, 2026 01:37
@github-actions
Copy link
Copy Markdown

@rtamma-nv rtamma-nv requested review from chet and tmcroberts97 May 12, 2026 01:57
rtamma-nv added 4 commits May 11, 2026 21:16
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't finished with the review yet, but I'd say the biggest problem right now is the breaking CLI changes, which don't immediately seem necessary to me.

Comment thread crates/rpc/proto/nmx_c.proto Outdated
Comment thread crates/admin-cli/src/cfg/cli_options.rs Outdated
Comment thread crates/libnmxc/src/lib.rs Outdated
Comment thread crates/libnmxc/tests/src/main.rs
Comment thread crates/api/src/handlers/nmxc_browse.rs Outdated
Comment thread crates/api/src/handlers/nmxc_browse.rs Outdated
Comment thread crates/api/src/handlers/nmxc_browse.rs Outdated
Comment thread crates/api/src/handlers/nmxc_browse.rs Outdated
Comment thread crates/api/src/handlers/nmxc_browse.rs Outdated
Comment thread crates/api/src/handlers/nmxc_browse.rs Outdated
Comment thread crates/api/src/handlers/nvlink_nmxc_endpoints.rs Outdated
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread crates/api/src/handlers/machine_discovery.rs
Comment thread crates/api/src/handlers/machine_discovery.rs Outdated
Comment thread crates/api/src/handlers/nvlink_nmxc_endpoints.rs Outdated
Comment thread crates/libnmxc/src/nmxc_api.rs Outdated
@rtamma-nv rtamma-nv force-pushed the feature/nmxc-paritioning branch from 24dda3f to 4d774b4 Compare May 14, 2026 04:10
rtamma-nv added 3 commits May 14, 2026 09:54
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
Signed-off-by: Roopesh Tamma <rtamma@nvidia.com>
@rtamma-nv rtamma-nv force-pushed the feature/nmxc-paritioning branch from 4d774b4 to 065a4b7 Compare May 14, 2026 17:55
Restore the separate commands. This will get added in another PR.

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
…oning

Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Signed-off-by: Thomas McRoberts <tmcroberts@nvidia.com>
Copy link
Copy Markdown
Contributor

@srinivasadmurthy srinivasadmurthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rtamma-nv rtamma-nv removed the request for review from chet May 18, 2026 14:00
Copy link
Copy Markdown
Contributor

@kensimon kensimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick is minor and can be addressed in a followup

Comment thread crates/libnmxc/src/lib.rs
Comment on lines +68 to +76
impl Endpoint {
pub fn new(url: impl AsRef<str>) -> Result<Self, NmxcError> {
let uri = url
.as_ref()
.parse::<Uri>()
.map_err(|e| NmxcError::InvalidEndpoint(format!("{}: {e}", url.as_ref())))?;
Ok(Self { uri })
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This can be impl FromStr for Endpoint if you want it to be a bit more idiomatic (let e: Endpoint = s.parse()?, etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants