Fix OOM when allEventsUpToLockProcessed buffer equals MaxRetries() by dnovitski · Pull Request #1666 · github/gh-ost

dnovitski · 2026-04-30T11:28:48Z

Problem

PR #1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts.

Crash

fatal error: runtime: out of memory

runtime.makechan(...)
github.com/github/gh-ost/go/logic.NewMigrator(...)
    go/logic/migrator.go:116

Fix

Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics.

When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel:

select {
case this.allEventsUpToLockProcessed <- lps:
default:
    // Buffer full — drain stale value, then send current one
    select { case <-this.allEventsUpToLockProcessed: default: }
    select { case this.allEventsUpToLockProcessed <- lps: default: }
}

This guarantees:

✅ Current sentinel is always delivered — stale messages are drained first
✅ executeWriteFuncs worker never blocks — the send is always non-blocking
✅ Constant memory — buffer of 1 regardless of MaxRetries()
✅ No extra timeouts — unlike a drop-newest approach, the current attempt's message is preserved

Why drop-newest would be wrong

A naive non-blocking send (select { default: }) would drop the current sentinel when the buffer is full. Since waitForEventsUpToLock only succeeds on an exact challenge match, dropping the current message causes a false timeout — potentially while the table is locked during two-step cutover.

Overwrite-oldest avoids this: the stale message (which would only be skipped by the receiver anyway) is drained to make room for the current one.

Tests

Overwrite-oldest test: sends two sentinels without consuming the first, verifies the channel contains only the latest message
Extreme MaxRetries regression test: sets MaxRetries() to 9999999999999 and verifies NewMigrator succeeds with channel capacity 1

PR github#1637 buffered allEventsUpToLockProcessed to MaxRetries() to prevent a goroutine deadlock when waitForEventsUpToLock times out during cutover. However, when --default-retries is set to a very large value (e.g. 9999999999999), Go tries to allocate a channel with trillions of buffer slots, causing an immediate OOM crash before the migration even starts. Replace the MaxRetries()-sized buffer with a buffer of 1 and overwrite-oldest (latest-wins) send semantics. When the buffer is full (receiver timed out on a previous attempt), the stale message is drained before sending the current sentinel. This guarantees: - The current sentinel is always delivered (no message loss) - The executeWriteFuncs worker is never blocked (no deadlock) - Memory usage is constant regardless of MaxRetries() (no OOM) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

meiji163 · 2026-04-30T17:49:27Z

LGTM 👍

dnovitski requested review from meiji163, rashiq and timvaillancourt as code owners April 30, 2026 11:28

dnovitski force-pushed the fix/cap-allEventsUpToLockProcessed-buffer branch from 87754d5 to 9956f87 Compare April 30, 2026 11:32

dnovitski force-pushed the fix/cap-allEventsUpToLockProcessed-buffer branch from 9956f87 to 7fd3640 Compare April 30, 2026 11:34

meiji163 approved these changes Apr 30, 2026

View reviewed changes

meiji163 merged commit b13e116 into github:master Apr 30, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OOM when allEventsUpToLockProcessed buffer equals MaxRetries()#1666

Fix OOM when allEventsUpToLockProcessed buffer equals MaxRetries()#1666
meiji163 merged 1 commit intogithub:masterfrom
dnovitski:fix/cap-allEventsUpToLockProcessed-buffer

dnovitski commented Apr 30, 2026

Uh oh!

meiji163 commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dnovitski commented Apr 30, 2026

Problem

Crash

Fix

Why drop-newest would be wrong

Tests

Uh oh!

meiji163 commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants