feat(pusher): Add concurrent log publishing with circuit breaker and retry heap by the-mann · Pull Request #2023 · aws/amazon-cloudwatch-agent

the-mann · 2026-02-13T03:44:01Z

Summary

Add multi-threaded log publishing to CloudWatch Logs with circuit breaker isolation and retry heap for failed batches.

Problem

The CloudWatch agent publishes logs synchronously — one batch at a time per target. When a target encounters persistent failures (e.g., AccessDenied), the retry loop blocks the entire pipeline, starving healthy log groups.

Solution

Architecture

Queue → SenderPool → WorkerPool (N workers) → Sender → PutLogEvents
                                                  ↓ (on failure)
                                            RetryHeap (unbounded min-heap)
                                                  ↓ (on retry time)
                                         RetryHeapProcessor → WorkerPool

Key Components

WorkerPool: Shared pool of N goroutines for concurrent PutLogEvents calls
RetryHeap: Unbounded min-heap ordered by next retry time for failed batches
RetryHeapProcessor: Periodically pops ready batches and resubmits to worker pool
Circuit Breaker: Per-target halt/resume — failing target blocks its own queue without affecting others

Circuit Breaker Flow

Batch fails → batch.fail() → queue.halt() — queue stops sending new batches
Failed batch pushed to RetryHeap with backoff
RetryHeapProcessor retries batch later
Success → batch.done() → queue.resume() — queue resumes
Expired (14d) → batch.done() → queue resumes, batch dropped

Poison Pill Fix

The retry heap is unbounded. A previous bounded implementation (size = concurrency) caused deadlock when failing log groups exceeded the heap size — workers blocked on Push(), starving all targets.

Changes

batch.go: Add retry metadata, state/fail/done callbacks, expiry logic
queue.go: Add channel-based halt/resume circuit breaker with mutex protection
sender.go: Push to retry heap on failure, call batch.fail() for circuit breaker
retryheap.go: Unbounded min-heap + RetryHeapProcessor with periodic retry loop
pool.go: WorkerPool and SenderPool for concurrent dispatch
pusher.go: Wire up retry heap and worker pool when concurrency enabled
cloudwatchlogs.go: Create shared WorkerPool and RetryHeap

Testing

Unit tests: 27 new tests covering retry heap, circuit breaker, poison pill, state callbacks, shutdown deadlock
Race detector: All tests pass with -race
Key scenarios tested:
- 10 denied + 1 allowed log group with concurrency=2 (poison pill)
- Shutdown while queue is halted (no deadlock)
- Batch expiry after 14 days resumes circuit breaker and persists state
- State callbacks fire on success and expiry but not on shutdown

Related PRs

feat(pusher): Add circuit breaker to halt queue on target failure #1975 — Circuit breaker (sender-block-on-failure)
Reapply "Add default log publishing concurrency (#1770)" (#1819) #1977 — Enable multi-threaded logging by default
fix(clean): Delete inline policies before IAM role deletion #2022 — IAM role cleanup fix (unrelated, merged from main)

Co-authored-by: Akansha Agarwal <agarakan@users.noreply.github.com>

…1974)

…ock-on-failure # Conflicts: # plugins/outputs/cloudwatchlogs/cloudwatchlogs.go # plugins/outputs/cloudwatchlogs/internal/pusher/batch.go # plugins/outputs/cloudwatchlogs/internal/pusher/batch_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/pusher.go # plugins/outputs/cloudwatchlogs/internal/pusher/pusher_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/queue_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/sender.go # plugins/outputs/cloudwatchlogs/internal/pusher/sender_test.go

…nder-block-on-failure # Conflicts: # plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_test.go

- Add mutex protection to Stop() method to prevent race conditions - Add stopped flag checks in Push() to prevent pushing after Stop() - Ensure Push() checks stopped flag both before and after acquiring semaphore - Fix TestRetryHeapStopTwice to verify correct behavior

- Add TestRetryHeapProcessorExpiredBatchShouldResume to demonstrate bug - When a batch expires after 14 days, RetryHeapProcessor calls updateState() but not done(), leaving circuit breaker permanently closed - Target remains blocked forever even though bad batch was dropped - Test currently fails, demonstrating the bug from PR comment

Verifies that startTime and expireAfter are only set once on first call and remain unchanged on subsequent calls, ensuring the 14-day expiration is measured from the first send attempt, not from each retry.

Concurrency is now determined by whether workerPool and retryHeap are provided, making the explicit concurrency parameter redundant. 🤖 Assisted by AI

The retry heap is now unbounded, so maxSize is no longer used. 🤖 Assisted by AI

batch.done() already calls updateState() internally, so the explicit call is unnecessary. 🤖 Assisted by AI

Test had no assertions and was not validating any behavior. 🤖 Assisted by AI

🤖 Assisted by AI

Variable was set but never checked in the test. 🤖 Assisted by AI

Circuit breaker should always block after exactly 1 send attempt, not "at most 1". 🤖 Assisted by AI

The dummyBatch was not connected to the queue's circuit breaker, so calling done() on it had no effect. Simplified test to only verify halt behavior. 🤖 Assisted by AI

- Replace sync.Cond with channel-based halt/resume to prevent shutdown deadlock (waitIfHalted now selects on haltCh and stopCh) - Add mutex to halt/resume/waitIfHalted for thread safety - Add TestQueueStopWhileHalted to verify no shutdown deadlock - Add TestQueueHaltResume with proper resume assertions - Clean up verbose test comments and weak assertions - Remove orphaned TestQueueResumeOnBatchExpiry comment 🤖 Assisted by AI

Verify state file management during retry, expiry, and shutdown: - Successful retry persists file offsets via state callbacks - Expired batch (14d) still persists offsets to prevent re-read - Clean shutdown does not persist state for unprocessed batches 🤖 Assisted by AI

- Fix TestRetryHeapProcessorSendsBatch: add events to batch, verify PutLogEvents is called and done callback fires (was testing empty batch) - Fix TestRetryHeapProcessorExpiredBatch: set expireAfter field so isExpired() actually returns true, verify done() is called - Fix race in TestRetryHeapProcessorSendsBatch: use atomic.Bool - Reduce TestRetryHeap_UnboundedPush sleep from 3s to 100ms 🤖 Assisted by AI

…adlock

…Groups TestPoisonPillScenario already covers the same scenario (10 denied + 1 allowed with low concurrency). The bounded heap no longer exists so the 'smaller than' framing is no longer meaningful. 🤖 Assisted by AI

🤖 Assisted by AI

CRITICAL fixes: - Handle retryHeap.Push() error in sender.Send() when heap is stopped during shutdown. Now calls batch.done() to persist state and resume circuit breaker instead of silently dropping the batch. - Fix Close() ordering: pushers stop before heap to allow in-flight sends to push failed batches. Remove duplicate Stop() calls. HIGH priority fixes: - Remove dead maxRetryDuration field from RetryHeapProcessor (batch expiry is handled by batch.expireAfter set in initializeStartTime) - Remove duplicate maxRetryTimeout constant from cloudwatchlogs.go (canonical definition is in batch.go) - Add clarifying comment about circuit breaker in synchronous mode MEDIUM priority fixes: - Add stopMu mutex to RetryHeapProcessor.Stop() for thread safety - Rename TestPoisonPillScenario to TestRetryHeapProcessorDoesNotStarveAllowedTarget (test doesn't exercise full pipeline) - Delete TestRecoveryAfterSystemRestart (doesn't test actual restart) - Delete TestRecoveryWithMultipleTargets (duplicates TestSingleDeniedLogGroup) LOW priority fixes: - Fix TestQueueHaltResume to avoid race condition - Replace stringPtr/int64Ptr helpers with aws.String()/aws.Int64()

the-mann · 2026-02-26T20:41:00Z

Incorporated review feedback. Integration tests passing:

Full suite: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/22458358617
Concurrency tests (recovery-test branch): https://github.com/aws/amazon-cloudwatch-agent/actions/runs/22458610567

the-mann · 2026-02-26T20:46:59Z

Updated PR with review feedback and merged latest from main. Commit to review: 9aabc04

Integration tests passing:

Full suite: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/22458358617
Concurrency tests (recovery-test branch): https://github.com/aws/amazon-cloudwatch-agent/actions/runs/22458610567

github-actions · 2026-03-06T00:22:54Z

This PR was marked stale due to lack of activity.

agarakan and others added 30 commits December 30, 2025 05:00

introduce retry metadata to batch struct

6b231dc

Remove unused reset method

8521373

add unit tests for retryMetadata

7244af8

fix lint

d2f21e1

Introduce retryHeap and retryHeapProcessor

66186a7

Exchange pushch for semaphor to enformce heap size and blocking

83224b4

Add conditional logic to sender to call batch.Fail() during concurrency

7cfc794

Add unit tests

b4ffd7a

Instantiate RetryHeap and RetryHeapProcessor if concurrency enabled

0e4b0bc

Add unit tests for retryheap instantiation

9c1332a

Update sender to reference retryHeap to call push on fail

dddb691

Add unit tests for sender logic

02bc5c6

Implement halt on target logic

ef7d627

lint

309f904

Introduce retry metadata to batch struct (#1970)

049ab97

Co-authored-by: Akansha Agarwal <agarakan@users.noreply.github.com>

Introduce RetryHeap and RetryHeapProcessor (#1972)

e7dae34

Sender invoke Fail Notification (#1973)

f89a8a7

Instantiate RetryHeap and RetryHeapProcessor if concurrency enabled (#…

87319ad

…1974)

lint

d9296a6

Merge remote-tracking branch 'origin/sender-block-on-failure' into se…

a5621fc

…nder-block-on-failure # Conflicts: # plugins/outputs/cloudwatchlogs/internal/pusher/pool_test.go # plugins/outputs/cloudwatchlogs/internal/pusher/retryheap_test.go

fix tests

7051a0c

lx

de410f1

Remove configurable maxRetryTimeout in favor of default hardcoded value

f4c7620

Update tests for removed retryDuration parameter

c791abd

Add test for initializeStartTime idempotency

4d798e2

Verifies that startTime and expireAfter are only set once on first call and remain unchanged on subsequent calls, ensuring the 14-day expiration is measured from the first send attempt, not from each retry.

refactor(pusher): Remove unused concurrency parameter from NewPusher

cdf1651

Concurrency is now determined by whether workerPool and retryHeap are provided, making the explicit concurrency parameter redundant. 🤖 Assisted by AI

the-mann added 16 commits February 12, 2026 10:23

Merge main into fix/poison-pill-deadlock

5e215da

refactor(pusher): Remove unused maxSize parameter from NewRetryHeap

fd2ea56

The retry heap is now unbounded, so maxSize is no longer used. 🤖 Assisted by AI

fix(pusher): Remove redundant updateState call in retryheap

38afc5f

batch.done() already calls updateState() internally, so the explicit call is unnecessary. 🤖 Assisted by AI

test(pusher): Remove empty TestSenderPoolRetryHeap test

3e1ed82

Test had no assertions and was not validating any behavior. 🤖 Assisted by AI

docs(pusher): Clean up verbose test comment in queue_test.go

2d07b38

🤖 Assisted by AI

docs(pusher): Clean up verbose test comment in retryheap_expiry_test.go

c739bb9

🤖 Assisted by AI

test(pusher): Remove unused circuitBreakerHalted variable

4e7393a

Variable was set but never checked in the test. 🤖 Assisted by AI

test(pusher): Use exact assertion for circuit breaker send count

f640a19

Circuit breaker should always block after exactly 1 send attempt, not "at most 1". 🤖 Assisted by AI

test(pusher): Remove ineffective dummyBatch code in TestQueueHaltResume

eb58912

The dummyBatch was not connected to the queue's circuit breaker, so calling done() on it had no effect. Simplified test to only verify halt behavior. 🤖 Assisted by AI

Merge main into fix/poison-pill-deadlock

6f941fe

Merge main into enable-multithreaded-logging-by-default

c0a43b0

Merge enable-multithreaded-logging-by-default into fix/poison-pill-de…

e59274a

…adlock

the-mann added the ready for testing Indicates this PR is ready for integration tests to run label Feb 13, 2026

the-mann mentioned this pull request Feb 13, 2026

Fix poison pill deadlock: Make retry heap unbounded #2014

Open

the-mann added 3 commits February 12, 2026 22:48

docs(pusher): Remove internal ticket references from test comments

334acdf

🤖 Assisted by AI

refactor(pusher): Simplify fail callback to direct method reference

b6f3b3e

🤖 Assisted by AI

style(pusher): Fix unused parameter lint warnings

98bdc89

🤖 Assisted by AI

the-mann force-pushed the fix/poison-pill-deadlock branch 3 times, most recently from a45d9be to 98bdc89 Compare February 19, 2026 18:24

Merge remote-tracking branch 'origin/main' into fix-pr-2023

eb798b5

github-actions bot added the Stale label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pusher): Add concurrent log publishing with circuit breaker and retry heap#2023

feat(pusher): Add concurrent log publishing with circuit breaker and retry heap#2023
the-mann wants to merge 65 commits intomainfrom
fix/poison-pill-deadlock

the-mann commented Feb 13, 2026

Uh oh!

the-mann commented Feb 26, 2026

Uh oh!

the-mann commented Feb 26, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

the-mann commented Feb 13, 2026

Summary

Problem

Solution

Architecture

Key Components

Circuit Breaker Flow

Poison Pill Fix

Changes

Testing

Related PRs

Uh oh!

the-mann commented Feb 26, 2026

Uh oh!

the-mann commented Feb 26, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants