Skip to content

Feat/cluster ops suite#4

Merged
OneNoted merged 6 commits intomainfrom
feat/cluster-ops-suite
Apr 20, 2026
Merged

Feat/cluster ops suite#4
OneNoted merged 6 commits intomainfrom
feat/cluster-ops-suite

Conversation

@OneNoted
Copy link
Copy Markdown
Owner

Expands pvt into a usable operational surface for Proxmox + Talos clusters and replaces the zig TUI with a Rust Ratatui TUI because I like ratatui sorry!

  • TUI rewritevitui reimplemented in Rust; preserves the four operational views, the pvt.yaml
    contract, and the pvt tui entrypoint.
  • Centralized health — shared health snapshot for configured clusters; status now reads from the same
    path, and a new doctor command covers local setup checks.
  • Drift + machine config diffs — plan-only drift detection with known remediation output and normalized
    Talos machine config diffs so operators can review desired vs. live state before acting.
  • Safe backup & upgrade workflows — Proxmox backup views/pruning scoped to configured VMIDs, plan-first
    node lifecycle commands, and upgrade preflight/postflight reports that fail on unhealthy results and honor
    the configured health timeout.
  • Cleanup — removed stale status helpers left by the health migration, reused shared config loading in
    upgrade, made backup VMID filtering directly testable.

Design notes

  • Remediation is plan-only — applying qm/talosctl changes is environment-specific and gated on
    operator approval.
  • Lifecycle commands use structured argv (not reparsed display strings) so config values can't leak as
    extra flags.
  • Talos config path and kubeconfig are kept distinct; talos.config_path is never reused as kubectl --kubeconfig.
  • No secret-bearing args on the Proxmox API auth command line.

…tui app

The existing TUI path was unstable enough to be effectively unusable, so this change rewrites vitui in Rust, keeps the four operational views, preserves the pvt.yaml contract, and rewires `pvt tui` to launch the new binary.

Constraint: Preserve the existing pvt config contract and `pvt tui` entrypoint while removing the Zig TUI as the primary path
Rejected: Full Go-to-Rust CLI rewrite now | too broad for the immediate stability problem
Rejected: Native HTTP/SDK rewrites for every integration | higher risk than parity-first subprocess/API compatibility
Confidence: medium
Scope-risk: broad
Reversibility: messy
Directive: Keep Talos config and kubeconfig handling distinct; do not reuse `talos.config_path` as `kubectl --kubeconfig`
Directive: Do not reintroduce secret-bearing command-line args for Proxmox API auth
Tested: cargo test; cargo build --release; cargo run -- --help; go test ./...
Not-tested: Live Proxmox/Talos/Kubernetes connectivity against a real cluster
Create a shared health snapshot for configured clusters and add a doctor command for local setup checks. Status now reads from the same snapshot so later commands can build on one observation path.

Constraint: Live Proxmox and Talos calls must degrade into reportable warnings where possible.
Rejected: Keep status on a separate Talos-only path | that would duplicate health collection and weaken later drift checks.
Confidence: medium
Scope-risk: moderate
Tested: go test ./...; go build ./...; go vet ./...
Not-tested: Live Proxmox, Talos, and Kubernetes infrastructure
Add drift detection, known remediation plan output, and normalized Talos machine config diffing so operators can compare desired and live-adjacent state before making changes.

Constraint: Remediation must remain plan-only because applying qm/talosctl changes is environment-specific.
Rejected: Auto-apply drift fixes | too destructive without a cluster-specific approval model.
Confidence: medium
Scope-risk: moderate
Tested: go test ./...; go build ./...; go vet ./...
Not-tested: Live drift against production Proxmox or Talos clusters
Round out the Go operational surface with Proxmox backup retention commands, plan-first node lifecycle commands, and upgrade preflight/postflight reports. Backup views and pruning are scoped to configured VMIDs, postflight reports fail on unhealthy results, and the upgrade safety gate honors the configured health timeout.

Constraint: Node and backup operations can be destructive in real clusters.
Constraint: gRPC v1.76.0 has a critical advisory and must be updated.
Rejected: Enable lifecycle mutation by default | operators need reviewable plans before host or backup changes.
Rejected: Reparse executable lifecycle commands from display strings | structured argv avoids config values becoming extra flags.
Confidence: medium
Scope-risk: broad
Tested: gofmt clean; go test -count=1 ./...; go build ./...; go vet ./...; cargo fmt --check; cargo test; cargo check; cargo clippy --all-targets -- -D warnings; validator approvals from architect/security/code-reviewer
Not-tested: Live backup deletion, kubectl drain, talosctl reboot, or live upgrade reports
Tighten the recent operational feature paths without changing behavior. Removed stale status helpers left behind by the health snapshot migration, reused shared config loading in the upgrade command, and made backup VMID filtering directly testable.

Constraint: Preserve behavior locked by existing Go and Rust quality gates.
Rejected: Broader command refactors | outside the cleanup scope and unnecessary for the identified smells.
Confidence: high
Scope-risk: narrow
Tested: gofmt clean; go test -count=1 ./...; go build ./...; go vet ./...; cargo fmt --check; cargo test; cargo check; cargo clippy --all-targets -- -D warnings
Not-tested: Live Proxmox, Talos, and Kubernetes workflows
@OneNoted OneNoted merged commit 8a8863b into main Apr 20, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant