Skip to content

Conversation

@goyalpalak18
Copy link

Description

I’ve fixed a critical race condition in our container lifecycle where we were telling the runtime (CRI) that the container had started successfully before we actually knew if the VMM was running.

Previously, we were using syscall.Exec() to replace the urunc process with the VMM. The problem with exec is that it doesn't return on success, so we were forced to send the StartSuccess signal optimistically before making the call. This meant that if the VMM failed to launch (missing binary, bad config, etc.), we ended up with a "zombie" state: Kubernetes thought the container was Running, but the process was actually dead.

To fix this, I refactored the startup logic to use a fork+exec pattern. My new flow looks like this:

  1. Spawn: I start the VMM as a child process using exec.Command().Start().
  2. Verify: I wait a brief window (100ms) and then check if the process is still alive using kill -0.
  3. Record: I capture the actual child PID and save it to state.json.
  4. Signal: I only send StartSuccess once I've confirmed the VMM is up and running.
  5. Detach: Finally, I exit the reexec process, leaving the VMM correctly orphaned to the shim.

Changes

  • pkg/unikontainers/unikontainers.go:
    • Switched from syscall.Exec to exec.Command().Start().
    • Added the liveness check (wait 100ms + kill -0) to catch immediate crashes.
    • Updated the logic to ensure we write the correct child PID to state.json before exiting.
  • pkg/unikontainers/types/types.go:
    • I updated the VMM interface with a BuildExecCmd method. This lets us build the command arguments separately from executing them.
  • Hypervisor Implementations (qemu.go, firecracker.go, hvt.go, spt.go):
    • I implemented BuildExecCmd for all our supported hypervisors.
    • Side fix: While I was in firecracker.go, I also fixed a swallowed json.Marshal error I noticed.
  • pkg/unikontainers/hypervisors/hedge.go:
    • Added a stub for BuildExecCmd to keep the interface satisfied.

Testing

I ran a few scenarios to confirm this behaves as expected:

  • Reproduction (The "Before" State): I intentionally configured urunc with a bad path to the QEMU binary. Before this fix, the pod would falsely report as Running with no logs and a bogus PID.
  • Verification (The Fix): using that same bad configuration, the pod now correctly fails to start. Kubernetes catches the error immediately (CreateContainerError / CrashLoopBackOff), which is exactly what we want.
  • Success Path: I also verified that valid unikernels still boot up fine, network comes up, and the process handoff works correctly.

Impact

  • No More Silent Failures: If the VMM breaks, the user will actually see it fail now.
  • Better Observability: Kubernetes will receive proper exit codes and error messages.
  • Accurate State: The PID inside state.json is finally trustworthy.

@netlify
Copy link

netlify bot commented Jan 24, 2026

Deploy Preview for urunc ready!

Name Link
🔨 Latest commit b728321
🔍 Latest deploy log https://app.netlify.com/projects/urunc/deploys/69788ab6a073b50008b84ea5
😎 Deploy Preview https://deploy-preview-397--urunc.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@goyalpalak18
Copy link
Author

Hi @cmainas , this fixes the critical race condition causing silent zombie containers by switching to a fork+exec model with a liveness check.

Please take a look when you have a moment.

Copy link
Contributor

@cmainas cmainas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @goyalpalak18 ,

thank you for this contribution. This is indeed a well known race, also documented in the comments inside the codebase. However, I have some comments:

  • I was not able to reproduce your Testing description. K8s was correctly identifying the container as failed.
  • This proposal is a significant change in urunc's execution model, where the monitor process is the "init" (pid 1) of the host container. It is not easy to make such a design change that way. There are ways to resolve the race issue without changing the design. The suggested changes to pre-built the monitor command before sending the started message is one of the ways to do that.
  • As you can see from the tests, the changes do not work.

Add BuildExecCmd() method to VMM interface to validate command
construction before sending StartSuccess message. This prevents
reporting a container as "Running" when the VMM command cannot
be built (e.g., due to invalid arguments or failed config writes).

The fix preserves the existing PID 1 execution model where the
reexec process uses syscall.Exec to become the VMM.
@goyalpalak18 goyalpalak18 force-pushed the fix/vmm-startup-verification branch from d40f5ac to f1ee61b Compare January 26, 2026 18:03
@goyalpalak18
Copy link
Author

@cmainas Thanks for the review. I have updated the PR based on your feedback:

Reverted Execution Model: I undid the PID 1 changes, so the original process hierarchy is preserved.

Implemented Suggestion: I added BuildExecCmd to validate the command before sending the StartSuccess message.

This resolves the race condition without changing the design.

Copy link
Contributor

@cmainas cmainas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @goyalpalak18 for the changes, but why are we building the exec command twice? Also, if we have the build command, then we can simply exec from unikontainers.Exec rather than inside each monitor

}

func (fc *Firecracker) Execve(args types.ExecArgs, ukernel types.Unikernel) error {
exArgs, err := fc.BuildExecCmd(args, ukernel)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we building the exec command twice?

@goyalpalak18
Copy link
Author

@cmainas Thanks for catching that! I've refactored the logic to simplify the flow as you suggested:

  • Removed the double build: BuildExecCmd is now called only once to get the arguments, which we then pass directly to the exec call.
  • Centralized Exec: I moved syscall.Exec up into the common unikontainers.go logic, so we don't need to repeat it inside every driver.
  • Simplified the Interface: I replaced Execve with a lighter PreExec hook in the VMM interface to handle just the driver-specific setup (like seccomp) before the main exec happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants