Skip to content

fix(pusher): Prevent batch expiration recalculation on retry#1988

Draft
agarakan wants to merge 11 commits intosender-block-on-failurefrom
remove_maxretryduration
Draft

fix(pusher): Prevent batch expiration recalculation on retry#1988
agarakan wants to merge 11 commits intosender-block-on-failurefrom
remove_maxretryduration

Conversation

@agarakan
Copy link
Copy Markdown
Contributor

@agarakan agarakan commented Jan 16, 2026

Description of the issue

Store expireAfter instead of maxRetryDuration for calculating the expiration of a batch request PLE.

Description of changes

  • When instantiating the batch object, we now set expireAfter based on maxRetryTimeout
  • Both startTime and expireAfter are only set once (idempotent) to prevent re-calculation on each retry
  • Added test to verify idempotency of initializeStartTime()

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Unit tests

make lint
make fmt
make fmt-sh
make test

Requirements

Before committing your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@agarakan agarakan requested a review from a team as a code owner January 16, 2026 22:26
@agarakan agarakan changed the base branch from main to enable-multithreaded-logging-by-default January 16, 2026 22:26
Comment thread plugins/outputs/cloudwatchlogs/internal/pusher/batch.go Outdated

// Initialize start time before build()
batch.initializeStartTime()
batch.initializeStartTime(s.RetryDuration())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update line 121 as well to use expireAfter.

Comment thread plugins/outputs/cloudwatchlogs/internal/pusher/batch.go Outdated
wg.Wait()
}

func TestResendWouldStopAfterExhaustedRetries(t *testing.T) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test is removed as redundant and would otherwise run for the entire default timeout period (14 days). Tested in Sender_test.go instead DropOnRetryExhaustion

@agarakan agarakan force-pushed the remove_maxretryduration branch from 6e76d60 to 409633b Compare January 20, 2026 21:39
Comment thread plugins/outputs/cloudwatchlogs/internal/pusher/batch.go Outdated
Comment thread plugins/outputs/cloudwatchlogs/internal/pusher/batch_test.go Outdated
@github-actions
Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Jan 28, 2026
Verifies that startTime and expireAfter are only set once on first call
and remain unchanged on subsequent calls, ensuring the 14-day expiration
is measured from the first send attempt, not from each retry.
@the-mann the-mann force-pushed the remove_maxretryduration branch from 409633b to 4d798e2 Compare February 10, 2026 18:38
@the-mann the-mann changed the base branch from enable-multithreaded-logging-by-default to sender-block-on-failure February 10, 2026 18:38
Concurrency is now determined by whether workerPool and retryHeap are
provided, making the explicit concurrency parameter redundant.

🤖 Assisted by AI
@the-mann the-mann changed the title Store expireAfter instead of maxRetryDuration fix(pusher): Prevent batch expiration recalculation on retry Feb 10, 2026
@github-actions github-actions bot removed the Stale label Feb 11, 2026
- TestPoisonPillScenario: Validates continuous batch generation with 10 denied + 1 allowed log group
- TestSingleDeniedLogGroup: Baseline test with 1 denied + 1 allowed log group
- TestRetryHeapSmallerThanFailingLogGroups: Demonstrates deadlock when heap size < failing log groups (SKIPPED)

The third test intentionally deadlocks to prove the bug exists when:
- Retry heap size = concurrency (2)
- Number of failing log groups (10) > heap size (2)
- Workers block trying to push to full heap
- System deadlocks, starving allowed log group
@the-mann the-mann marked this pull request as draft February 11, 2026 20:08
Remove max size constraint from retry heap to prevent deadlock when
failing log groups exceed concurrency limit.

Changes:
- Remove maxSize and semaphore from retryHeap struct
- Make Push() non-blocking (no semaphore wait)
- Remove semaphore release from PopReady()
- Update NewRetryHeap() to ignore maxSize parameter (kept for API compatibility)
- Update TestRetryHeap_SemaphoreBlockingAndUnblocking -> TestRetryHeap_UnboundedPush
- Update TestRetryHeapSmallerThanFailingLogGroups to validate fix

Before: With concurrency=2 and 10 failing log groups, retry heap (size=2)
would fill up, causing workers to block on Push(), leading to deadlock.

After: Retry heap is unbounded, allowing all failed batches to be queued
without blocking workers. Allowed log groups continue publishing normally.

Test results:
- TestRetryHeapSmallerThanFailingLogGroups: PASS (5/5 allowed batches published)
- Heap grew to size 28 (beyond concurrency limit of 2)
- No deadlock or starvation
@github-actions
Copy link
Copy Markdown
Contributor

This PR was marked stale due to lack of activity.

@github-actions github-actions bot added the Stale label Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants