Skip to content

http2: fix zombie session crash on socket close#61702

Open
suuuuuuminnnnnn wants to merge 4 commits intonodejs:mainfrom
suuuuuuminnnnnn:fix-http2-zombie-session
Open

http2: fix zombie session crash on socket close#61702
suuuuuuminnnnnn wants to merge 4 commits intonodejs:mainfrom
suuuuuuminnnnnn:fix-http2-zombie-session

Conversation

@suuuuuuminnnnnn
Copy link

Summary

This PR fixes a crash caused by HTTP/2 zombie sessions when the underlying socket is dead but the HTTP/2 session remains “alive” in Node.js.

  • The underlying socket can become closed at the OS level without Node.js receiving a close event (e.g. packet drop “black hole”).
  • Subsequent writes can hit internal invariants and crash the process:
    • CHECK(is_write_in_progress()) in Http2Session::OnStreamAfterWrite
    • CHECK(!current_write_) in TLSWrap::DoWrite
  • Fix: Close the HTTP/2 session when a read error occurs (nread < 0) in Http2Session::OnStreamRead.

Fixes: #61304


What is the current behavior?

When the network enters a “black hole” state (packets dropped without RST/FIN), the OS may consider the TCP socket dead/closed, but Node.js can fail to observe the close.

In this case:

  • The HTTP/2 session stays in a zombie state:
    • session.closed === false
    • session.destroyed === false
    • outbound queue grows (session.state.outboundQueueSize increases continuously)
  • Later write attempts can break invariants and crash the process via assertions.

Relevant code path (before):

// Only pass data on if nread > 0
if (nread <= 0) {
  if (nread < 0) {
    PassReadErrorToPreviousListener(nread);
  }
  return;  // session not closed -> zombie possible
}

What is the new behavior?

On read error (nread < 0), we still notify the previous listener, then close the HTTP/2 session to prevent zombie state and subsequent assertion crashes.

if (nread <= 0) {
  if (nread < 0) {
    PassReadErrorToPreviousListener(nread);
    // Close the session to prevent zombie state when the underlying socket is dead.
    Close(NGHTTP2_NO_ERROR, true);
  }
  return;
}

Key points:

  • Minimal change: only affects the read-error path.
  • Close() is idempotent / safe to call redundantly.
  • Preserves callback ordering (previous listener notified first).

Why is this correct?

A negative nread indicates an error/EOF condition at the stream/socket read layer. Keeping the HTTP/2 session alive after a read error allows the session to continue queueing outbound frames while the transport is effectively dead, which eventually leads to inconsistent internal write state and assertions.

Closing the session on read error restores the expected lifecycle invariant: dead transport ⇒ closed session.


How was this tested?

New test (CI-friendly)

Adds a new parallel test that reproduces the “zombie” behavior without OS firewall rules by destroying the underlying socket directly:

  • client[kSocket].destroy() to kill the transport.
  • Attempt follow-up requests.
  • Verify: no process crash, session cleans up, and the close path is taken.

New file:

  • test/parallel/test-http2-zombie-session-61304.js

Commands

Built locally (macOS arm64):

./configure
make -j1

Ran test:

out/Release/node test/parallel/test-http2-zombie-session-61304.js

Spot-checked a related HTTP/2 test:

out/Release/node test/parallel/test-http2-client-socket-destroy.js

Additional context / reproduction (original issue)

The original issue can be reproduced reliably on macOS using pf firewall rules to drop packets (reported 100% reproduction), creating a network “black hole” where the socket becomes dead without Node observing a clean close.

Issue: #61304


Files changed

  • src/node_http2.cc (small change: close session on nread < 0)
  • test/parallel/test-http2-zombie-session-61304.js (new)

@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/http2
  • @nodejs/net

@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. http2 Issues or PRs related to the http2 subsystem. needs-ci PRs that need a full CI run. labels Feb 6, 2026
@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.76%. Comparing base (ae2ffce) to head (020578f).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #61702   +/-   ##
=======================================
  Coverage   89.75%   89.76%           
=======================================
  Files         674      674           
  Lines      204416   204417    +1     
  Branches    39285    39284    -1     
=======================================
+ Hits       183472   183488   +16     
- Misses      13227    13228    +1     
+ Partials     7717     7701   -16     
Files with missing lines Coverage Δ
src/node_http2.cc 82.05% <100.00%> (+0.19%) ⬆️

... and 42 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mcollina mcollina added the request-ci Add this label to start a Jenkins CI on a PR. label Feb 6, 2026
@suuuuuuminnnnnn
Copy link
Author

Hi @mcollina, I pushed a lint fix commit but CI didn’t re-run—could you please re-trigger the checks for this PR?

@github-actions github-actions bot added request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed. and removed request-ci Add this label to start a Jenkins CI on a PR. labels Feb 6, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Feb 6, 2026

Failed to start CI
   ⚠  Commits were pushed since the last approving review:
   ⚠  - http2: fix zombie session crash on socket close
   ⚠  - http2: prevent assertion failure in OnStreamAfterWrite
   ⚠  - http2: handle write callback gracefully in zombie sessions
   ✘  Refusing to run CI on potentially unsafe PR
https://github.com/nodejs/node/actions/runs/21752594207

Copy link
Member

@pimterry pimterry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @suuuuuuminnnnnn!

It looks like you've removed the fix that was described in the README here as part of your changes (in 6da4fe9) and now in the complete changeset it's just disabling an assertion. Is that intentional?

In the test, I think there's a little more work required to properly reproduce the issue described. I've just checked, and it passes on my machine with current versions of Node, without your src changes. The test should fail without your fix, and then pass once the fix is applied.

It would also be good to avoid the 1 second setTimeout here - timeouts like that are generally a fragile solution to race conditions and collectively they make our test suite much slower. It's better to use events & callbacks to cleanup at the correct time instead of guessing a duration. Once you have a failing test & working fix for it, let me know if you need help finding a good approach to avoid that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Issues and PRs that require attention from people who are familiar with C++. http2 Issues or PRs related to the http2 subsystem. needs-ci PRs that need a full CI run. request-ci-failed An error occurred while starting CI via request-ci label, and manual interventon is needed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Assertion failure crash in TLSWrap::DoWrite with zombie HTTP/2 session (close event not propagated from OS-level CLOSED socket)

4 participants