Skip to content

fix: Recover Hutch worker when consumer channel is closed due to delivery_ack_timeout?#414

Merged
michaelklishin merged 2 commits intoruby-amqp:mainfrom
Garaio-REM:fix-channel-closed-error
Apr 2, 2026
Merged

fix: Recover Hutch worker when consumer channel is closed due to delivery_ack_timeout?#414
michaelklishin merged 2 commits intoruby-amqp:mainfrom
Garaio-REM:fix-channel-closed-error

Conversation

@eglitobias
Copy link
Copy Markdown
Collaborator

Summary

This MR fixes the case where RabbitMQ closes the consumer channel and Hutch gets stuck because subsequent ack/nack calls raise ChannelAlreadyClosed.

When that happens today, Hutch can stay alive but stop consuming new messages until the process is restarted. That is the behavior described in issue #364.

What changed

  • expose ChannelAlreadyClosed from both adapters:
    • Hutch::Adapter::Bunny::ChannelAlreadyClosed
    • Hutch::Adapter::MarchHare::ChannelAlreadyClosed
  • update Hutch::Worker to:
    • rescue ChannelAlreadyClosed when acknowledge_error tries to ack/nack
    • trigger channel recovery instead of letting the worker remain stuck
  • add guarded recovery logic using a mutex so only one recovery runs at a time
  • recover by:
    • reopening the AMQP channel on the existing connection
    • re-running queue setup / consumer subscriptions
  • log recovery success or failure
  • add specs for:
    • closed-channel ack not raising
    • closed-channel nack after consumer error not raising
    • end-to-end recovery after the consumer channel is closed

Copy link
Copy Markdown
Member

@michaelklishin michaelklishin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-opening channels that were closed by the server is OK in some cases (e.g. due to a delivery acknowledgement timeout)
but not others.

This reinvents a portion of connection recovery in Bunny, ignores Bunny::Channel#open?, existing eventual assertion helpers.

@michaelklishin
Copy link
Copy Markdown
Member

One thing that this PR does get right: it recovers in one specific place instead of using Bunny::Channel#on_error. This is acceptable and such recover should succeed in most cases in practice (assuming that only the channel was closed).

I'm working on adding a couple of Bunny::Channel and Bunny::Session methods to Bunny 3.0 for such cases.

michaelklishin added a commit to ruby-amqp/bunny that referenced this pull request Mar 30, 2026
@michaelklishin
Copy link
Copy Markdown
Member

With [1][2], the implementation of this PR can shrink significantly, coming down to something like this:

ch.on_error do |channel, channel_close_method|
  if channel_close_method.delivery_ack_timeout?
    channel.reopen
    connection.recover_channel_topology(channel)
  end
end
  1. ruby-amqp/amq-protocol@dea05d4
  2. ruby-amqp/bunny@4b2fa43

@eglitobias eglitobias force-pushed the fix-channel-closed-error branch 3 times, most recently from be883d7 to 1f23bfc Compare March 31, 2026 11:54
Comment thread Gemfile Outdated

gemspec

gem 'bunny', github: 'ruby-amqp/bunny', branch: 'main'
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only used this for local testing. As soon as bunny 3 is released we can change this in the hutch.gemspec

@eglitobias
Copy link
Copy Markdown
Collaborator Author

Thank you very much for the quick review.

I’ve updated the PR accordingly. Most of the work went into building an integration test that reliably reproduces the issue.

One thing I’m still unsure about is whether reopening the channel from a separate thread is the best approach here. I wasn’t able to find a cleaner way to make channel.reopen succeed in this setup, so I’d be glad to hear if there is a better alternative.

I also opened a small follow-up PR in amq-protocol because delivery_ack_timeout? does not return a boolean value:
ruby-amqp/amq-protocol#85

@eglitobias eglitobias changed the title fix: Recover Hutch worker when consumer channel is closed during ack/… fix: Recover Hutch worker when consumer channel is closed due to delivery_ack_timeout? Mar 31, 2026
@michaelklishin
Copy link
Copy Markdown
Member

If the channel is already closed by the server, there won't be any activity on it, so all scenarios where reopening the channel would entail a concurrency hazard come down to the channel ID allocator safety.

All key methods of the allocator are synchronized.

@michaelklishin
Copy link
Copy Markdown
Member

@eglitobias voilà, Bunny 3.0.0 is here. Happens to be on Apr 1st in Europe but I am dead serious :)

@michaelklishin michaelklishin mentioned this pull request Apr 1, 2026
@michaelklishin
Copy link
Copy Markdown
Member

michaelklishin commented Apr 1, 2026

Hutch main now depends on Bunny 3.0 #415

@eglitobias eglitobias force-pushed the fix-channel-closed-error branch from 1f23bfc to f664026 Compare April 1, 2026 05:17
@eglitobias
Copy link
Copy Markdown
Collaborator Author

eglitobias commented Apr 1, 2026

Rebased on top of Bunny 3.0 and updated the branch.
@michaelklishin Thanks again for the super fast follow-up and for the very convincing "totally not an April 1st joke" release.
I was briefly worried Bunny 3.0 would ship with an April-1st-only channel allocator mode.

@eglitobias
Copy link
Copy Markdown
Collaborator Author

Not sure why the pipeline is failing now with Ruby 3.3.
But the error in the log is related to the deprecated ddtrace Gem.
I opened this MR #416: Switch to the new gem (datadog) and also add Ruby 3.4 and Ruby 4 pipelines.

@michaelklishin
Copy link
Copy Markdown
Member

@eglitobias having considered the "offloading" thread workaround you have ended up with, I think another change to Bunny can be justified: the Bunny::Channel#on_error callback can be invoked in a new thread by Bunny itself.

Otherwise the entire feature set would require every single user to start a new thread.

Bunny's I/O loop thread is different from the application thread (let's ignore the single threaded mode, virtually no one even knows about it even though it was contributed around 2012). Which means using a yet another separate thread won't make much difference for the end user.

michaelklishin added a commit to ruby-amqp/bunny that referenced this pull request Apr 1, 2026
This callback already runs on a thread that's
different from the application thread, so
this behavior change should not affect any
apps.

On the upside, this allows the callback use
`Bunny::Session` methods, including protocol
operation ones, without blocking the I/O loop.

References ruby-amqp/hutch#414
michaelklishin added a commit that referenced this pull request Apr 2, 2026
@michaelklishin michaelklishin merged commit d299c67 into ruby-amqp:main Apr 2, 2026
4 of 6 checks passed
@michaelklishin michaelklishin added this to the 1.4.0 milestone Apr 2, 2026
@michaelklishin
Copy link
Copy Markdown
Member

@eglitobias I have released Bunny 3.1.0 and refactored this PR to not use a separate thread.

We can still introduce more tweaks in follow-up PRs.

@eglitobias eglitobias deleted the fix-channel-closed-error branch April 8, 2026 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants