OCPBUGS-83817: Fix orphaned shell processes in pod terminal on WebSocket disconnect#16307
Conversation
…ket disconnect When a terminal WebSocket connection drops, the shell process inside the container was never terminated, leading to accumulation of orphaned processes. This occurred because gorilla/websocket's Close() only drops the TCP connection without sending a WebSocket close frame, and no exit command was sent to the shell. Send "exit" to the exec STDIN channel and a proper WebSocket close frame to the Kubernetes API server when the proxy connection closes. Use a dedicated backendWriteMutex to prevent concurrent writes to the backend connection, and check the negotiated subprotocol to use the correct frame encoding.
|
@Leo6Leo: This pull request references Jira Issue OCPBUGS-83817, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@Leo6Leo: This pull request references Jira Issue OCPBUGS-83817, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Leo6Leo: This pull request references Jira Issue OCPBUGS-83817, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@Leo6Leo: This pull request references Jira Issue OCPBUGS-83817, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Leo6Leo: This pull request references Jira Issue OCPBUGS-83817, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.22 |
|
@Leo6Leo: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
📝 WalkthroughWalkthroughThis change modifies the WebSocket HTTP proxy handling in 🚥 Pre-merge checks | ✅ 9 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (9 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/proxy/proxy.go`:
- Around line 260-265: The deferred cleanup grabs backendWriteMutex
unconditionally which can block forever if copyMsgs is stuck writing; change the
deferred closure in the function containing backendWriteMutex so it attempts to
acquire the lock with a bounded wait (spawn a goroutine that calls
backendWriteMutex.Lock() and signals on a channel, then select between that
channel and time.After(timeout)); if the lock isn't acquired within the timeout,
log a warning and proceed to call backend.Close() (and still try to
sendExecExitCommand in a non-blocking manner or drop it) to avoid stalling
disconnect; apply the same bounded-lock pattern wherever backendWriteMutex is
used in deferred teardown (e.g., around sendExecExitCommand in the defer and the
analogous case near line 309).
- Line 259: Normalize the request path before checking for the "/exec" suffix to
handle trailing-slash variants: compute a normalizedPath from r.URL.Path (e.g.,
trim trailing slashes or use path.Clean) and then set isExec by testing
strings.HasSuffix(normalizedPath, "/exec"); update the reference to r.URL.Path
where isExec is computed so the detection logic uses normalizedPath instead of
the raw r.URL.Path.
- Around line 335-345: The switch on backend.Subprotocol() in proxy.go currently
handles "base64.channel.k8s.io", "v4.channel.k8s.io", and "v5.channel.k8s.io"
but omits v2 and v3 so exec exit is skipped for those protocols; update the
switch to include "v2.channel.k8s.io" and "v3.channel.k8s.io" and treat them the
same as the "v4.channel.k8s.io"/"v5.channel.k8s.io" branch (i.e., set msg to
append([]byte{0}, []byte("exit\r")...) and msgType to websocket.BinaryMessage)
OR, if you intentionally restrict to v4+, add a clear comment and observable
log/error where backend.Subprotocol() is checked (the switch and the default
klog.V(4).Infof) documenting that only v4+ is supported so the behavior is
explicit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 68d7e985-8731-4746-9550-c4d09c61ad53
📒 Files selected for processing (1)
pkg/proxy/proxy.go
|
/retest-required |
1 similar comment
|
/retest-required |
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
QA Verification Evidence
Verification Steps
Before / After
Evidence GIFs
NotesThis PR modifies only Warning This verification was performed by an AI agent. Results may contain false positives or miss Automated QA verification by Claude Code |
|
/lgtm |
|
@jhadvig: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jhadvig, Leo6Leo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherry-pick release-4.22 |
|
@jhadvig: once the present PR merges, I will cherry-pick it on top of DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@Leo6Leo: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@Leo6Leo: Jira Issue Verification Checks: Jira Issue OCPBUGS-83817 Jira Issue OCPBUGS-83817 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@Leo6Leo: new pull request created: #16694 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Fix included in release 5.0.0-0.nightly-2026-07-01-190408 |










Summary
shprocessesClose()only drops the TCP connection without sending a WebSocket close frame, and noexitcommand was sent to the shell on disconnectexitthrough the exec STDIN channel and a proper WebSocket close frame to the Kubernetes API server when the proxy connection closesbackendWriteMutexto prevent concurrent writes to the backend connection (gorilla/websocket's one-writer contract)base64.channel.k8s.iovsv4/v5.channel.k8s.io) to use the correct frame encodingDetails
The cleanup in
componentWillUnmount(frontend) only runs when the user navigates away from the terminal page. When the WebSocket drops while the page stays loaded (e.g., load balancer idle timeout, network interruption), the shell process inside the container survives indefinitely. Each reconnection spawns a new shell, causing orphaned processes to accumulate. This is especially impactful for high-usage pods like DB2U.The fix adds cleanup at the proxy layer (
pkg/proxy/proxy.go), which handles all disconnect scenarios regardless of how the frontend behaves.Test plan
ps auxviaoc exec)go test ./pkg/proxy/)oc rsh/oc exec) unaffected (fix only targets/execWebSocket path in the proxy)Summary by CodeRabbit