Skip to content

fix(packaging): add upgrade migration docs and podman socket retry#1507

Merged
TaylorMutch merged 1 commit into
NVIDIA:mainfrom
maxamillion:fix/packaging-follow-ups
May 21, 2026
Merged

fix(packaging): add upgrade migration docs and podman socket retry#1507
TaylorMutch merged 1 commit into
NVIDIA:mainfrom
maxamillion:fix/packaging-follow-ups

Conversation

@maxamillion
Copy link
Copy Markdown
Collaborator

Summary

Follow-up to #1415. Adds the upgrade migration guidance that was missing from the original PR, and makes the Podman driver resilient to the transient socket unavailability observed during RPM upgrade testing.

Intended to land after #1415 merges. The branch is based on `use-toml-in-install`.

Related Issue

Follow-up to #1415

Changes

  • docs(rpm): Add "Migrating from gateway.env" section to `TROUBLESHOOTING.md` covering:
    • Backward compatibility: existing `gateway.env` files are still honored via `EnvironmentFile`
    • Env-var to TOML key mapping table
    • Three breaking changes from refactor(packaging): rely on gateway runtime defaults #1415: default port 8080→17670, bind address `0.0.0.0`→`127.0.0.1`, database path move with manual migration command
    • `podman.socket` restart added to the upgrade procedure with explanation
  • docs(rpm): Add upgrade callout block to `CONFIGURATION.md` pointing at the migration section
  • fix(podman): Retry `PodmanComputeDriver::new()` ping up to 5 times (2s between attempts) to tolerate transient socket unavailability. During RPM upgrade testing `podman.socket` became non-functional when its unit file changed on disk mid-run. With `Wants=podman.socket` the gateway starts before the socket recovers, hits `Connection refused`, and enters a crash-loop. The retry gives the socket a 10s window to re-activate.
  • chore(rpm): Update `EnvironmentFile` comment in the RPM spec to document the backward-compatibility intent.

Testing

  • `mise run pre-commit` — lint, format, license, markdown, helm, Python tests all pass
  • `cargo test -p openshell-driver-podman` — 58/58 pass
  • Manual RPM install + upgrade test on Fedora VM — confirmed the `podman.socket` crash-loop scenario and validated the retry fix resolves it

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

@maxamillion maxamillion requested review from a team, derekwaynecarr and mrunalp as code owners May 21, 2026 20:10
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@maxamillion maxamillion force-pushed the fix/packaging-follow-ups branch from 632bfda to 62eb7e7 Compare May 21, 2026 20:15
TaylorMutch
TaylorMutch previously approved these changes May 21, 2026
@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test 62eb7e7

After NVIDIA#1415 ships, users upgrading from previous releases need guidance
on the gateway.env deprecation, port/bind/database path changes, and
the podman.socket restart requirement.

- docs(rpm): add 'Migrating from gateway.env' section to TROUBLESHOOTING
  covering backward compatibility, env-to-TOML key mapping, and three
  breaking changes (default port 8080->17670, bind address 0.0.0.0->127.0.0.1,
  database path move). Add podman.socket restart step to upgrade procedure.
- docs(rpm): add upgrade callout to CONFIGURATION.md pointing at migration
  section.
- fix(podman): retry PodmanComputeDriver ping up to 5 times with 2s delay
  to tolerate transient socket unavailability after package upgrades.
  The systemd unit uses Wants=podman.socket (not Requires) so the gateway
  can start while the socket is briefly re-activating after an RPM upgrade
  changes its unit file on disk.
- chore(rpm): update EnvironmentFile comment in RPM spec to explain
  backward-compatibility intent.

Signed-off-by: Adam Miller <admiller@redhat.com>
@maxamillion maxamillion force-pushed the fix/packaging-follow-ups branch from 4409315 to 790adb8 Compare May 21, 2026 21:10
@TaylorMutch
Copy link
Copy Markdown
Collaborator

/ok to test 790adb8

@TaylorMutch TaylorMutch merged commit f5b0ad7 into NVIDIA:main May 21, 2026
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants