feat(hardware): use NVML to grab the hardware profile during the regi…#314
feat(hardware): use NVML to grab the hardware profile during the regi…#314patelspratik merged 13 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the register flow to collect a richer cross-platform hardware profile, leveraging NVML on supported platforms, and persists/sends that profile during node registration.
Changes:
- Replaced the old command/file-based hardware collection with a
HardwareProfilerinterface and per-OS implementations (Linux/macOS + stub). - Added NVML-based GPU + interconnect probing and expanded the local model from
NodeSpectoHardwareProfile. - Updated registration persistence and RPC conversion/tests to use
HardwareProfile.
Reviewed changes
Copilot reviewed 13 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/cmd/register/rpcclient_test.go | Updates RPC conversion tests to use HardwareProfile and new fields. |
| pkg/cmd/register/rpcclient.go | Converts HardwareProfile into proto NodeSpec (with TODOs for new proto fields). |
| pkg/cmd/register/register_test.go | Switches register tests to stub hardware profiling via mockHardwareProfiler. |
| pkg/cmd/register/register.go | Wires registration flow to HardwareProfiler and persists HardwareProfile. |
| pkg/cmd/register/hardware_test.go | Updates formatting/storage parsing tests and removes old nvidia-smi parsing tests. |
| pkg/cmd/register/hardware_stub.go | Adds unsupported-platform stub SystemHardwareProfiler. |
| pkg/cmd/register/hardware_linux.go | Adds Linux hardware profiling via sysfs/procfs + NVML probing. |
| pkg/cmd/register/hardware_darwin.go | Adds macOS hardware profiling via sysctl/sw_vers/diskutil parsing. |
| pkg/cmd/register/hardware.go | Introduces HardwareProfile/supporting structs + formatting and parsing helpers. |
| pkg/cmd/register/gpu_nvml.go | Adds NVML GPU/interconnect probing (linux/windows build tags). |
| pkg/cmd/register/device_registration_store_test.go | Updates persistence round-trip test to use HardwareProfile. |
| pkg/cmd/register/device_registration_store.go | Renames persisted field to hardware_profile and updates struct type. |
| go.mod | Adds NVML dependency and makes x/sys a direct dependency. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
pkg/cmd/register/hardware_stub.go
Outdated
| //go:build !linux && !darwin && !windows | ||
|
|
||
| package register | ||
|
|
||
| // SystemHardwareProfiler is a no-op adapter for unsupported platforms. | ||
| type SystemHardwareProfiler struct{} | ||
|
|
||
| func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) { | ||
| return &HardwareProfile{}, nil |
There was a problem hiding this comment.
This stub excludes Windows (!windows), but defaultRegisterDeps() constructs &SystemHardwareProfiler{} and gpu_nvml.go is built for windows, so the repo will fail to compile on Windows because no SystemHardwareProfiler implementation exists under //go:build windows. Fix by adding a hardware_windows.go implementation (recommended) or adjusting build tags so Windows is covered by a real implementation or a stub. Also, returning an entirely empty profile loses Architecture; consider populating Architecture: runtime.GOARCH at minimum so registration doesn’t send a blank arch on unsupported platforms.
| //go:build !linux && !darwin && !windows | |
| package register | |
| // SystemHardwareProfiler is a no-op adapter for unsupported platforms. | |
| type SystemHardwareProfiler struct{} | |
| func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) { | |
| return &HardwareProfile{}, nil | |
| //go:build !linux && !darwin | |
| package register | |
| import "runtime" | |
| // SystemHardwareProfiler is a no-op adapter for unsupported platforms. | |
| type SystemHardwareProfiler struct{} | |
| func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) { | |
| return &HardwareProfile{ | |
| Architecture: runtime.GOARCH, | |
| }, nil |
pkg/cmd/register/hardware_stub.go
Outdated
| //go:build !linux && !darwin && !windows | ||
|
|
||
| package register | ||
|
|
||
| // SystemHardwareProfiler is a no-op adapter for unsupported platforms. | ||
| type SystemHardwareProfiler struct{} | ||
|
|
||
| func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) { | ||
| return &HardwareProfile{}, nil |
There was a problem hiding this comment.
This stub excludes Windows (!windows), but defaultRegisterDeps() constructs &SystemHardwareProfiler{} and gpu_nvml.go is built for windows, so the repo will fail to compile on Windows because no SystemHardwareProfiler implementation exists under //go:build windows. Fix by adding a hardware_windows.go implementation (recommended) or adjusting build tags so Windows is covered by a real implementation or a stub. Also, returning an entirely empty profile loses Architecture; consider populating Architecture: runtime.GOARCH at minimum so registration doesn’t send a blank arch on unsupported platforms.
| //go:build !linux && !darwin && !windows | |
| package register | |
| // SystemHardwareProfiler is a no-op adapter for unsupported platforms. | |
| type SystemHardwareProfiler struct{} | |
| func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) { | |
| return &HardwareProfile{}, nil | |
| //go:build !linux && !darwin | |
| package register | |
| import "runtime" | |
| // SystemHardwareProfiler is a no-op adapter for unsupported platforms. | |
| type SystemHardwareProfiler struct{} | |
| func (p *SystemHardwareProfiler) Profile() (*HardwareProfile, error) { | |
| return &HardwareProfile{ | |
| Architecture: runtime.GOARCH, | |
| }, nil |
| func parseDiskutilSize(val string, dev *StorageDevice) { | ||
| parts := strings.Fields(val) | ||
| if len(parts) < 1 { | ||
| return | ||
| } | ||
| var size int64 | ||
| for _, c := range parts[0] { | ||
| if c >= '0' && c <= '9' { | ||
| size = size*10 + int64(c-'0') | ||
| } | ||
| } | ||
| dev.StorageBytes = size | ||
| } |
There was a problem hiding this comment.
This parser will produce incorrect sizes for common diskutil info formats (e.g., Disk Size: 1.0 TB (1,000,000,000,000 Bytes)), because it only extracts digits from the first token (1.0 -> 10). Parse the byte count inside parentheses (the (... Bytes) portion) or use a more robust strategy (e.g., regexp to capture the integer byte value) so StorageBytes is accurate.
There was a problem hiding this comment.
Claude independently found this issue as well
| // parseDiskutilInfoOutput parses "diskutil info -all" output into StorageDevice entries. | ||
| func parseDiskutilInfoOutput(output string) []StorageDevice { |
There was a problem hiding this comment.
New macOS parsing logic (parseDiskutilInfoOutput / parseDiskutilLine / parseDiskutilSize) is introduced without unit tests, while this package already has parsing unit tests (e.g., hardware_test.go). Add table-driven tests covering representative diskutil info -all snippets (whole disk vs partition identifiers, size formats with parentheses/commas, SSD/HDD/NVMe protocol) to prevent silent mis-parsing.
| // parseDiskutilLine processes a single line of diskutil output, updating the | ||
| // current device state. Returns the (possibly new) current device pointer. | ||
| func parseDiskutilLine(line string, current *StorageDevice, devices *[]StorageDevice) *StorageDevice { |
There was a problem hiding this comment.
New macOS parsing logic (parseDiskutilInfoOutput / parseDiskutilLine / parseDiskutilSize) is introduced without unit tests, while this package already has parsing unit tests (e.g., hardware_test.go). Add table-driven tests covering representative diskutil info -all snippets (whole disk vs partition identifiers, size formats with parentheses/commas, SSD/HDD/NVMe protocol) to prevent silent mis-parsing.
pkg/cmd/register/hardware.go
Outdated
| type StorageDevice struct { | ||
| Name string `json:"name,omitempty"` | ||
| StorageBytes int64 `json:"storage_bytes"` | ||
| StorageType string `json:"storage_type,omitempty"` // "SSD" or "HDD" |
There was a problem hiding this comment.
StorageType is documented as only "SSD" or "HDD", but Linux and macOS profilers can emit "NVMe". Update the comment (and any assumptions elsewhere) to reflect the actual set of values (e.g., "SSD", "HDD", "NVMe", or consider separating interface type from media type).
| StorageType string `json:"storage_type,omitempty"` // "SSD" or "HDD" | |
| StorageType string `json:"storage_type,omitempty"` // e.g., "SSD", "HDD", or "NVMe" |
| go get -u github.com/brevdev/dev-plane@$$COMMIT; \ | ||
| GOPRIVATE=github.com/brevdev/* go get -u github.com/brevdev/dev-plane@$$COMMIT; \ | ||
| go get buf.build/gen/go/brevdev/devplane/grpc/go@$$COMMIT; \ | ||
| go get buf.build/gen/go/brevdev/devplane/protocolbuffers/go@$$COMMIT; \ | ||
| go mod tidy; \ | ||
| GOPRIVATE=github.com/brevdev/* go mod tidy; \ | ||
| echo "Successfully updated to $$COMMIT" No newline at end of file |
There was a problem hiding this comment.
Unrelated but this allows for fetching internal dependencies without relying on the GOPRIVATE to have been set by the user.
| builds: | ||
| - env: | ||
| - CGO_ENABLED=0 | ||
| - CGO_ENABLED=1 |
There was a problem hiding this comment.
We now always need CGO
pkg/cmd/register/gpu_nvml.go
Outdated
| ) | ||
|
|
||
| // archNames maps NVML compute capability (major version) to GPU architecture name. | ||
| var archNames = map[int]string{ |
There was a problem hiding this comment.
- GPU architecture mapping inaccuracies
- Compute major version 7 → "Volta/Turing" is imprecise (Volta=sm_70, Turing=sm_75). A T4 showing "Volta/Turing" is technically misleading.
- Ada Lovelace GPUs have compute capability 8.9 (major=8), so they'd map to "Ampere", not "Hopper/Ada Lovelace". Major version 9 should just be "Hopper".
| } | ||
|
|
||
| // parseDiskutilInfoOutput parses "diskutil info -all" output into StorageDevice entries. | ||
| func parseDiskutilInfoOutput(output string) []StorageDevice { |
There was a problem hiding this comment.
also add tests for this?
|
Dropping this comment I made in the teams chat for posterity. Dear @jarrad :
|
| func parseStorageOutput(output string) []NodeStorage { | ||
| var devices []NodeStorage | ||
| // returning one StorageDevice entry per disk device. ROTA=0 → SSD, ROTA=1 → HDD. | ||
| func parseStorageOutput(output string) []StorageDevice { |
There was a problem hiding this comment.
unused (only referenced in tests)
| owner: brevdev | ||
| name: brev-cli |
There was a problem hiding this comment.
These are now required in version: 2
| folder: Formula | ||
| directory: Formula |
There was a problem hiding this comment.
Renamed in version: 2
| format: zip | ||
| formats: [zip] |
There was a problem hiding this comment.
format deprecated in favor of formats in version: 2
| - -X github.com/brevdev/brev-cli/pkg/cmd/version.Version={{.Tag}} | ||
| - id: darwin-arm64 | ||
| env: | ||
| - CGO_ENABLED=1 |
There was a problem hiding this comment.
It looks like under the hood the goreleaser image finds its own C toolchain for OSX
|
Blocking Issues
Files: hardware_linux.go, hardware_darwin.go Both platform implementations always return nil error. The HardwareProfiler interface declares Profile() (*HardwareProfile, error), but the error is never used. This means a complete profiling failure (every probe fails) silently returns a near-empty HardwareProfile with only Architecture set — indistinguishable from success. Options:
File: hardware_darwin.go The fallback digit-extraction logic on strings like "500.1 GB" would produce 5001 (bytes), not ~500 GB: for _, c := range parts[0] { Recommendation: Remove the fallback entirely (leave StorageBytes as 0 if the regex doesn't match) or parse it properly with unit awareness.
File: hardware.go The function accesses s.ProductName without a nil check on s. Add a guard:
File: rpcclient_test.go toProtoNodeSpec maps PCIe Generation and Width to proto fields, but only the NVLink branch is tested. A mapping bug in the PCIe path would go undetected. Add a test case with a PCIe interconnect. Should Fix
File: hardware.go parseMemInfoContent() and parseStorageOutput() appear unused — Linux now uses syscall.Sysinfo and probeStorageSysfs() respectively. If they're intentionally kept for testing/fallback, add a comment. Otherwise remove them.
File: gpu_nvml.go Inner name := archName(...) shadows the outer name (GPU device name) from a few lines above. Rename to archStr or similar for clarity.
File: gpu_nvml.go 18 is hardcoded as the max NVLink link count. Extract to a named constant:
File: Makefile This target doesn't set CGO_ENABLED=1, but all other build targets now require it. Since Go disables CGO by default when cross-compiling, this would produce a binary without CGO support. Verify this doesn't break smoke tests or workspace setup.
File: hardware.go NVLink shows ActiveLinks and Version, but PCIe interconnects don't show Generation or Width. The PR output shows "Link: PCIe (GPU 0)" without gen/width. Consider:
These are one-line mock changes to the existing infrastructure and would meaningfully improve confidence. |
93a9a5b to
e75f724
Compare
Tested on a g4 (Tesla T4 GPU) EC2 instance: