Skip to content

feat(hardware): use NVML to grab the hardware profile during the regi…#314

Merged
patelspratik merged 13 commits intomainfrom
codedor/register/hardware_profiler
Mar 6, 2026
Merged

feat(hardware): use NVML to grab the hardware profile during the regi…#314
patelspratik merged 13 commits intomainfrom
codedor/register/hardware_profiler

Conversation

@drewmalin
Copy link
Contributor

@drewmalin drewmalin commented Mar 5, 2026

Tested on a g4 (Tesla T4 GPU) EC2 instance:

 Registering your device with Brev
  Name:         dmec2
  Organization: d
  Registering for Linux user:   ec2-user
This will perform the following steps:
  1. Set up Brev tunnel
  2. Collect hardware profile
  3. Register this machine with Brev
✔ Yes, proceed
[Step 1/3] Setting up Brev tunnel...
  Brev tunnel ready.
[Step 2/3] Collecting hardware profile...
  Hardware profile:
    Product: g4dn.xlarge
    CPU:     4 cores
    RAM:     15.4 GB
    GPUs:    1 x Tesla T4 (15.0 GB) [Volta/Turing]
    Arch:    amd64
    OS:      Amazon Linux 2023
    Link:    PCIe (GPU 0)
    Storage: 15.0 GB (NVMe) [nvme0n1]
    Storage: 116.4 GB (NVMe) [nvme1n1]

[Step 3/3] Registering with Brev...
  Registration complete.
✔ Yes, proceed
Enabling SSH access on this device
  Node:       dmec2 (extnode-3AXjSn7LsRDkLoWz2gSLXqtL4OS)
  Brev user:  user-2zNbeNob6WIEQqW7c9Fe3xyILYb
  Linux user: ec2-user
  SSH port (default 22):  █
  Brev public key added to authorized_keys.
  SSH access not yet granted; retrying in: 1s...
  SSH access not yet granted; retrying in: 1.5s...
  SSH access not yet granted; retrying in: 1.5s...
SSH access enabled. You can now SSH to this device via: brev shell dmec2
[ec2-user@ip-172-16-51-235 ~]$ 
Screenshot 2026-03-05 at 13 24 12

Copilot AI review requested due to automatic review settings March 5, 2026 20:09
@drewmalin drewmalin requested a review from a team as a code owner March 5, 2026 20:09
patelspratik
patelspratik previously approved these changes Mar 5, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the register flow to collect a richer cross-platform hardware profile, leveraging NVML on supported platforms, and persists/sends that profile during node registration.

Changes:

  • Replaced the old command/file-based hardware collection with a HardwareProfiler interface and per-OS implementations (Linux/macOS + stub).
  • Added NVML-based GPU + interconnect probing and expanded the local model from NodeSpec to HardwareProfile.
  • Updated registration persistence and RPC conversion/tests to use HardwareProfile.

Reviewed changes

Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/cmd/register/rpcclient_test.go Updates RPC conversion tests to use HardwareProfile and new fields.
pkg/cmd/register/rpcclient.go Converts HardwareProfile into proto NodeSpec (with TODOs for new proto fields).
pkg/cmd/register/register_test.go Switches register tests to stub hardware profiling via mockHardwareProfiler.
pkg/cmd/register/register.go Wires registration flow to HardwareProfiler and persists HardwareProfile.
pkg/cmd/register/hardware_test.go Updates formatting/storage parsing tests and removes old nvidia-smi parsing tests.
pkg/cmd/register/hardware_stub.go Adds unsupported-platform stub SystemHardwareProfiler.
pkg/cmd/register/hardware_linux.go Adds Linux hardware profiling via sysfs/procfs + NVML probing.
pkg/cmd/register/hardware_darwin.go Adds macOS hardware profiling via sysctl/sw_vers/diskutil parsing.
pkg/cmd/register/hardware.go Introduces HardwareProfile/supporting structs + formatting and parsing helpers.
pkg/cmd/register/gpu_nvml.go Adds NVML GPU/interconnect probing (linux/windows build tags).
pkg/cmd/register/device_registration_store_test.go Updates persistence round-trip test to use HardwareProfile.
pkg/cmd/register/device_registration_store.go Renames persisted field to hardware_profile and updates struct type.
go.mod Adds NVML dependency and makes x/sys a direct dependency.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +9
//go:build !linux && !darwin && !windows

package register

// SystemHardwareProfiler is a no-op adapter for unsupported platforms.
type SystemHardwareProfiler struct{}

func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) {
return &HardwareProfile{}, nil
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stub excludes Windows (!windows), but defaultRegisterDeps() constructs &SystemHardwareProfiler{} and gpu_nvml.go is built for windows, so the repo will fail to compile on Windows because no SystemHardwareProfiler implementation exists under //go:build windows. Fix by adding a hardware_windows.go implementation (recommended) or adjusting build tags so Windows is covered by a real implementation or a stub. Also, returning an entirely empty profile loses Architecture; consider populating Architecture: runtime.GOARCH at minimum so registration doesn’t send a blank arch on unsupported platforms.

Suggested change
//go:build !linux && !darwin && !windows
package register
// SystemHardwareProfiler is a no-op adapter for unsupported platforms.
type SystemHardwareProfiler struct{}
func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) {
return &HardwareProfile{}, nil
//go:build !linux && !darwin
package register
import "runtime"
// SystemHardwareProfiler is a no-op adapter for unsupported platforms.
type SystemHardwareProfiler struct{}
func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) {
return &HardwareProfile{
Architecture: runtime.GOARCH,
}, nil

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +9
//go:build !linux && !darwin && !windows

package register

// SystemHardwareProfiler is a no-op adapter for unsupported platforms.
type SystemHardwareProfiler struct{}

func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) {
return &HardwareProfile{}, nil
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This stub excludes Windows (!windows), but defaultRegisterDeps() constructs &SystemHardwareProfiler{} and gpu_nvml.go is built for windows, so the repo will fail to compile on Windows because no SystemHardwareProfiler implementation exists under //go:build windows. Fix by adding a hardware_windows.go implementation (recommended) or adjusting build tags so Windows is covered by a real implementation or a stub. Also, returning an entirely empty profile loses Architecture; consider populating Architecture: runtime.GOARCH at minimum so registration doesn’t send a blank arch on unsupported platforms.

Suggested change
//go:build !linux && !darwin && !windows
package register
// SystemHardwareProfiler is a no-op adapter for unsupported platforms.
type SystemHardwareProfiler struct{}
func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) {
return &HardwareProfile{}, nil
//go:build !linux && !darwin
package register
import "runtime"
// SystemHardwareProfiler is a no-op adapter for unsupported platforms.
type SystemHardwareProfiler struct{}
func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) {
return &HardwareProfile{
Architecture: runtime.GOARCH,
}, nil

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +143
func parseDiskutilSize(val string, dev *StorageDevice) {
parts := strings.Fields(val)
if len(parts) < 1 {
return
}
var size int64
for _, c := range parts[0] {
if c >= '0' && c <= '9' {
size = size*10 + int64(c-'0')
}
}
dev.StorageBytes = size
}
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parser will produce incorrect sizes for common diskutil info formats (e.g., Disk Size: 1.0 TB (1,000,000,000,000 Bytes)), because it only extracts digits from the first token (1.0 -> 10). Parse the byte count inside parentheses (the (... Bytes) portion) or use a more robust strategy (e.g., regexp to capture the integer byte value) so StorageBytes is accurate.

Copilot uses AI. Check for mistakes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude independently found this issue as well

Comment on lines +66 to +67
// parseDiskutilInfoOutput parses "diskutil info -all" output into StorageDevice entries.
func parseDiskutilInfoOutput(output string) []StorageDevice {
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New macOS parsing logic (parseDiskutilInfoOutput / parseDiskutilLine / parseDiskutilSize) is introduced without unit tests, while this package already has parsing unit tests (e.g., hardware_test.go). Add table-driven tests covering representative diskutil info -all snippets (whole disk vs partition identifiers, size formats with parentheses/commas, SSD/HDD/NVMe protocol) to prevent silent mis-parsing.

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +86
// parseDiskutilLine processes a single line of diskutil output, updating the
// current device state. Returns the (possibly new) current device pointer.
func parseDiskutilLine(line string, current *StorageDevice, devices *[]StorageDevice) *StorageDevice {
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New macOS parsing logic (parseDiskutilInfoOutput / parseDiskutilLine / parseDiskutilSize) is introduced without unit tests, while this package already has parsing unit tests (e.g., hardware_test.go). Add table-driven tests covering representative diskutil info -all snippets (whole disk vs partition identifiers, size formats with parentheses/commas, SSD/HDD/NVMe protocol) to prevent silent mis-parsing.

Copilot uses AI. Check for mistakes.
type StorageDevice struct {
Name string `json:"name,omitempty"`
StorageBytes int64 `json:"storage_bytes"`
StorageType string `json:"storage_type,omitempty"` // "SSD" or "HDD"
Copy link

Copilot AI Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageType is documented as only "SSD" or "HDD", but Linux and macOS profilers can emit "NVMe". Update the comment (and any assumptions elsewhere) to reflect the actual set of values (e.g., "SSD", "HDD", "NVMe", or consider separating interface type from media type).

Suggested change
StorageType string `json:"storage_type,omitempty"` // "SSD" or "HDD"
StorageType string `json:"storage_type,omitempty"` // e.g., "SSD", "HDD", or "NVMe"

Copilot uses AI. Check for mistakes.
Comment on lines -308 to 332
go get -u github.com/brevdev/dev-plane@$$COMMIT; \
GOPRIVATE=github.com/brevdev/* go get -u github.com/brevdev/dev-plane@$$COMMIT; \
go get buf.build/gen/go/brevdev/devplane/grpc/go@$$COMMIT; \
go get buf.build/gen/go/brevdev/devplane/protocolbuffers/go@$$COMMIT; \
go mod tidy; \
GOPRIVATE=github.com/brevdev/* go mod tidy; \
echo "Successfully updated to $$COMMIT" No newline at end of file
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated but this allows for fetching internal dependencies without relying on the GOPRIVATE to have been set by the user.

builds:
- env:
- CGO_ENABLED=0
- CGO_ENABLED=1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now always need CGO

)

// archNames maps NVML compute capability (major version) to GPU architecture name.
var archNames = map[int]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. GPU architecture mapping inaccuracies
  • Compute major version 7 → "Volta/Turing" is imprecise (Volta=sm_70, Turing=sm_75). A T4 showing "Volta/Turing" is technically misleading.
  • Ada Lovelace GPUs have compute capability 8.9 (major=8), so they'd map to "Ampere", not "Hopper/Ada Lovelace". Major version 9 should just be "Hopper".

}

// parseDiskutilInfoOutput parses "diskutil info -all" output into StorageDevice entries.
func parseDiskutilInfoOutput(output string) []StorageDevice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add tests for this?

@drewmalin
Copy link
Contributor Author

Dropping this comment I made in the teams chat for posterity. Dear @jarrad :

I ran into compilation issues as it looks like the new dependency go-nvml uses C libraries under the hood, which means we need to compile those (even though those are going to point to dynamic libraries, we still need to compile the go-nvml glue)

so this means that we enter into the world of CGO + cross-compiling, which is notoriously silly

this commit adds some code to our Makefile to perform cross-compiling in a dedicated docker container that matches the target architecture, that way we then simply enable CGO during compilation within the container to take advantage of its C toolchain

this allows the cross-compilation from my mac to work again, and I tested this in my local VM and am seeing the host details come through

func parseStorageOutput(output string) []NodeStorage {
var devices []NodeStorage
// returning one StorageDevice entry per disk device. ROTA=0 → SSD, ROTA=1 → HDD.
func parseStorageOutput(output string) []StorageDevice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused (only referenced in tests)

Comment on lines +68 to +69
owner: brevdev
name: brev-cli
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now required in version: 2

Comment on lines -20 to +45
folder: Formula
directory: Formula
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in version: 2

Comment on lines -40 to +65
format: zip
formats: [zip]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format deprecated in favor of formats in version: 2

- -X github.com/brevdev/brev-cli/pkg/cmd/version.Version={{.Tag}}
- id: darwin-arm64
env:
- CGO_ENABLED=1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like under the hood the goreleaser image finds its own C toolchain for OSX

@patelspratik
Copy link
Contributor

Blocking Issues

  1. Profile() never returns an error — interface contract violation

Files: hardware_linux.go, hardware_darwin.go

Both platform implementations always return nil error. The HardwareProfiler interface declares Profile() (*HardwareProfile, error), but the error is never used. This means a complete profiling failure (every probe fails) silently returns a near-empty HardwareProfile with only Architecture set — indistinguishable from success.

Options:

  • Collect errors from sub-probes and return a combined error if all critical probes fail
  • Or remove error from the interface if best-effort is truly the design intent (and document why)
  • At minimum, log warnings when individual probes fail so operators can diagnose
  1. parseDiskutilSize fallback produces garbage values

File: hardware_darwin.go

The fallback digit-extraction logic on strings like "500.1 GB" would produce 5001 (bytes), not ~500 GB:

for _, c := range parts[0] {
if c >= '0' && c <= '9' {
size = size*10 + int64(c-'0')
}
}

Recommendation: Remove the fallback entirely (leave StorageBytes as 0 if the regex doesn't match) or parse it properly with unit awareness.

  1. FormatHardwareProfile panics on nil input

File: hardware.go

The function accesses s.ProductName without a nil check on s. Add a guard:
if s == nil {
return ""
}

  1. Missing test for PCIe interconnect proto mapping

File: rpcclient_test.go

toProtoNodeSpec maps PCIe Generation and Width to proto fields, but only the NVLink branch is tested. A mapping bug in the PCIe path would go undetected. Add a test case with a PCIe interconnect.


Should Fix

  1. Dead code after refactor

File: hardware.go

parseMemInfoContent() and parseStorageOutput() appear unused — Linux now uses syscall.Sysinfo and probeStorageSysfs() respectively. If they're intentionally kept for testing/fallback, add a comment. Otherwise remove them.

  1. Variable shadowing in probeGPUsNVML

File: gpu_nvml.go

Inner name := archName(...) shadows the outer name (GPU device name) from a few lines above. Rename to archStr or similar for clarity.

  1. Magic number for NVLink max links

File: gpu_nvml.go

18 is hardcoded as the max NVLink link count. Extract to a named constant:
const maxNVLinks = 18 // Blackwell supports up to 18

  1. build-linux-amd target missing CGO_ENABLED=1

File: Makefile

This target doesn't set CGO_ENABLED=1, but all other build targets now require it. Since Go disables CGO by default when cross-compiling, this would produce a binary without CGO support. Verify this doesn't break smoke tests or workspace setup.

  1. FormatHardwareProfile doesn't display PCIe details

File: hardware.go

NVLink shows ActiveLinks and Version, but PCIe interconnects don't show Generation or Width. The PR output shows "Link: PCIe (GPU 0)" without gen/width. Consider:
Link: PCIe Gen4 x16 (GPU 0)

  1. Three untested error paths in register_test.go
  • Platform incompatibility (IsCompatible() == false)
  • Hardware profiler failure (Profile() returns error)
  • NetBird installation failure (Install() returns error)

These are one-line mock changes to the existing infrastructure and would meaningfully improve confidence.

patelspratik
patelspratik previously approved these changes Mar 6, 2026
@patelspratik patelspratik force-pushed the codedor/register/hardware_profiler branch from 93a9a5b to e75f724 Compare March 6, 2026 23:46
@patelspratik patelspratik merged commit cc138e8 into main Mar 6, 2026
9 checks passed
@patelspratik patelspratik deleted the codedor/register/hardware_profiler branch March 6, 2026 23:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants