Skip to content

fix(purchases): tighten AWS SDK retry budget + bump Lambda timeout to 300s (closes #683)#684

Open
cristim wants to merge 12 commits into
feat/multicloud-web-frontendfrom
fix/issue-683-lambda-sdk-tightening
Open

fix(purchases): tighten AWS SDK retry budget + bump Lambda timeout to 300s (closes #683)#684
cristim wants to merge 12 commits into
feat/multicloud-web-frontendfrom
fix/issue-683-lambda-sdk-tightening

Conversation

@cristim
Copy link
Copy Markdown
Member

@cristim cristim commented May 22, 2026

Summary

Third recurrence of #667 prompted this PR. Original scope was tighten the AWS SDK retry budget; user pushback on the "hanging goroutine" hypothesis correctly identified that SDK-side bounds alone don't cover hangs upstream of the SDK call (STS validation, credential resolution, fan-out scheduling). Scope expanded to a defense-in-depth approach: bound the per-rec budget AND make hangs observable.

Closes #683. Refs #667 #668 #632 #681 #600.

What changed (3 parts)

Part 1 — bound the AWS SDK retry storm

  • providers/aws/internal/purchasecfg new package, NewConfig(base aws.Config) returns a copy with RetryMaxAttempts=2 + http.Client{Timeout: 15s}. Worst-case wall clock per SDK request: 2 × 15s = 30s.
  • 7 purchase-path service constructors (EC2, RDS, ElastiCache, MemoryDB, OpenSearch, Redshift, SavingsPlans) wired through purchasecfg.NewConfig. OpenSearch + Redshift also tighten their STS client used for ARN/tagging.
  • Recommendation-collection clients are NOT touchedrecommendations.NewClient is independent and legitimately needs the SDK default 3-retry budget.

Part 2 — bound the per-rec execution itself

  • context.WithTimeout(ctx, 30s) wrapping the entire purchase chain inside executeSinglePurchase (provider construction + STS validation + service-client lookup + details decode + the AWS SDK call). Caps the per-rec budget regardless of which step hangs. 30s matches the part-1 SDK ceiling.
  • All 42 mock.On(...) call sites in the purchase tests updated from ctx literal to mock.MatchedBy(func(c context.Context) bool { _, ok := c.Deadline(); return ok }) — positively asserts the deadline propagates.

Part 3 — make hangs observable

Every step of the purchase chain now emits an Info log tagged with the execution ID so a single CloudWatch filter (filter @message like /purchase\[<exec-id>\]/) returns the full timing breakdown. Lines added (one per commit):

Stage Sample log line
Approve handler entry/exit purchase[<id>]: approve entry (auth=session) / ... exit success in 4.2s
resolveAccountProvider + AWS/Azure/GCP sub-resolvers purchase[<id>]: resolved AWS provider for account <X> in 142ms
AWS STS GetCallerIdentity + Azure subscriptions ListPager validation purchase[<id>]: STS GetCallerIdentity for account <X> starting / ... returned in 87ms
findOfferingID pagination in all 7 services purchase[<id>]: EC2 offering page 1 (token=*) returned in 612ms with 14 offerings
FanOutWithConcurrency per-goroutine fan-out goroutine entry (account=<X>) / ... exit success in 1.8s

Token / credential values are masked to the last 8 chars per the feedback_pii_in_logs.md memory entry.

Part 4 — Lambda budget headroom

  • lambda_timeout bumped 60 -> 300s in github-{dev,staging,prod}.tfvars. With the per-rec 30s ceiling from Part 2, a single hung rec is now nowhere near the Lambda ceiling — the bigger budget exists to absorb large multi-rec executions, not to mask hangs.

Commit list (10)

  1. df4755556 purchasecfg helper
  2. fe72342f7 wire purchasecfg into 7 service constructors
  3. 7a899ed0e lambda_timeout 60 -> 300
  4. a8677e939 per-rec context.WithTimeout(30s)
  5. 3d10f285c approve handler entry/exit logs
  6. 6804727eb resolveAccountProvider timing logs
  7. 8f49cf2b2 STS + Azure validation call timing logs
  8. 2919058f5 per-page logs in 7 AWS findOfferingID loops
  9. 09462dd9d FanOutWithConcurrency per-goroutine timing
  10. e4ad3eacd test fixups (mock.MatchedBy for the 42 deadline-aware ctx call sites + 1 savingsplans test parameter)

Test plan

  • go test ./internal/purchase/... ./internal/execution/... ./internal/api/... ./providers/aws/services/... -count=1 -short — 1732 tests pass across 10 packages
  • go build ./... — clean
  • terraform validate in terraform/environments/aws/ — clean
  • Recommendation-collection path NOT affected (purchasecfg only applied to purchase-path clients)
  • Post-deploy: t4g.nano repro attempt either succeeds OR fails within ≤30s with the new log lines pinning which step hung. CloudWatch query: filter @message like /purchase\[<exec-id>\]/ | sort @timestamp asc

Cross-references

Summary by CodeRabbit

  • Improvements
    • Expanded structured logging and timing across approval, execution, fan-out and provider flows (with actor masking) for better diagnostics and correlation.
    • Added per-recommendation execution timeouts and tightened SDK retry/HTTP timeouts for more predictable purchases.
    • Enhanced offering-lookup and per-account timing logs.
    • Increased Lambda timeout limits to 300s.
  • Tests
    • Relaxed context matching in tests to tolerate per-recommendation deadline contexts.

Review Change Stack

cristim added 3 commits May 22, 2026 18:46
…K retries

Introduce providers/aws/internal/purchasecfg.NewConfig which returns a copy
of the caller's aws.Config with RetryMaxAttempts=2 and http.Client.Timeout=15s.
Worst-case wall clock: 2 * 15s = 30s, well inside the 300s Lambda budget.

Recommendation-collection clients (recommendations.NewClient) are explicitly
NOT touched; they call Cost Explorer and need the higher default retry budget.

Adds five unit tests covering: retry cap, HTTP timeout, immutability of the
base config, region preservation, and the wall-clock bound invariant.

Closes #683. Refs #667 #668 #632 #681.
…ctors

Switch EC2, RDS, ElastiCache, MemoryDB, OpenSearch, Redshift, and SavingsPlans
NewClient calls to use purchasecfg.NewConfig(cfg) so every purchase-path SDK
client gets RetryMaxAttempts=2 and http.Client.Timeout=15s.

OpenSearch and Redshift also pass the tightened config to their STS client
(used for ARN construction / tagging on the purchase path).

Recommendation-collection clients (recommendations.NewClient) are not touched;
they use their own constructor and deliberately keep the SDK default retry
budget for Cost Explorer calls.

All 694 tests pass. go build ./... clean.

Closes #683. Refs #667.
Raise lambda_timeout from 60s to 300s in github-dev, github-staging, and
github-prod tfvars. The SDK retry tightening (purchasecfg, max 30s wall clock)
combined with the increased Lambda budget means transient AWS API slowness
either completes within 30s or surfaces a clear error to the user -- no more
silent context deadline exceeded from the Lambda itself expiring mid-retry.

Lambda Function URL has no upstream timeout ceiling, so 300s is safe.

Closes #683. Refs #667 #681.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3896640f-783a-4270-be2b-927a038ed776

📥 Commits

Reviewing files that changed from the base of the PR and between ee235cf and a477d7c.

📒 Files selected for processing (3)
  • internal/purchase/coverage_extra_test.go
  • internal/purchase/execution_test.go
  • internal/purchase/manager_test.go

📝 Walkthrough

Walkthrough

Adds timing and structured logs to approval and provider flows, enforces 30s per-recommendation timeouts, introduces purchase-specific AWS SDK config (2 retries, 15s HTTP timeout) wired to purchase-path clients, relaxes tests to accept deadline-bearing contexts, and increases Lambda timeout to 300s.

Changes

Purchase Approval and Execution Hardening

Layer / File(s) Summary
Approval handler and flow observability
internal/api/handler_purchases.go, internal/purchase/approvals.go
Request entry, completion, and failure logging in approvePurchaseViaSession, ApproveExecution, and ApproveAndExecute with elapsed-time measurement and masked actor values.
AWS SDK purchase-path configuration package
providers/aws/internal/purchasecfg/config.go, providers/aws/internal/purchasecfg/config_test.go
New purchasecfg package with MaxAttempts=2, HTTPTimeout=15s, and NewConfig(base aws.Config) aws.Config that returns a copied config with capped retries and a per-request HTTP timeout; unit tests verify behavior and immutability.
Wiring purchase config across AWS service clients
providers/aws/services/{ec2,rds,elasticache,memorydb,opensearch,redshift,savingsplans}/client.go
All purchase-path AWS clients now use purchasecfg.NewConfig(cfg), pass opts.ExecutionID to offering lookups for log correlation, and add per-page timing/structured logs during pagination.
Provider credential resolution & provider ValidateCredentials timing
internal/purchase/execution.go, providers/aws/provider.go, providers/azure/provider.go
Credential-resolution paths (resolveAccountProvider, provider-specific resolvers) instrumented with start/failure/success elapsed-time logs; AWS/Azure ValidateCredentials now time STS/subscriptions calls and log durations.
Per-recommendation 30-second execution timeout and diagnostics
internal/purchase/execution.go
executeSinglePurchase creates a 30s recCtx per recommendation and uses it for provider creation, GetServiceClient, and PurchaseCommitment; adds logRecCtxErr to distinguish DeadlineExceeded vs Canceled with elapsed timing.
Fan-out execution per-account timing and observability
internal/execution/fanout.go
Per-account goroutine invocations are wrapped with start/elapsed timing and Info/Error logs on start, completion, and failure.
Test mock expectations relaxed for deadline-bearing contexts
internal/purchase/{coverage_extra_test.go,execution_test.go,manager_test.go}
Tests updated to match contexts via mock.Anything or mock.MatchedBy(deadline-check) rather than exact ctx, accommodating derived/deadlined contexts.
Lambda execution timeout increase 60s → 300s
terraform/environments/aws/github-{dev,staging,prod}.tfvars
Lambda timeout extended from 60 to 300 seconds to accommodate AWS SDK retry/HTTP-timeout settings and purchase request handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

Possibly related PRs

  • LeanerCloud/CUDly#613: Overlaps on orphan-check logging and approval-path changes.
  • LeanerCloud/CUDly#668: Overlaps on purchase execution logging/timing and complements this PR’s per-rec timeouts and credential-resolution logs.
  • LeanerCloud/CUDly#373: Prior edits to approval flows where similar logging/masking was touched.

Poem

🐰 I time each hop and guard the log,

Mask my name, avoid the fog.
Thirty seconds, retries few,
Lambdas wake, not stuck like glue.
Hooray — purchases sing anew.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately summarizes the main changes: AWS SDK retry tightening plus Lambda timeout increase to 300 seconds.
Linked Issues check ✅ Passed The PR implements all coding requirements from issue #683: Lambda timeout bumped to 300s in all three tfvars, purchase-path clients (EC2, RDS, ElastiCache, MemoryDB, OpenSearch, Redshift, SavingsPlans) configured with RetryMaxAttempts=2 and HTTP timeout=15s, per-record 30s timeout applied, exec-tagged timing logs added, and recommendations path unchanged.
Out of Scope Changes check ✅ Passed All changes are directly aligned with issue #683 requirements: SDK configuration tightening, Lambda timeout bump, per-record deadline enforcement, and observability logging are all in-scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-683-lambda-sdk-tightening

Comment @coderabbitai help to get the list of available commands and usage tips.

@cristim cristim added priority/p0 Drop everything; same-day fix severity/critical Major harm when it happens urgency/now Drop other things impact/all-users Affects every user effort/m Days type/bug Defect triaged Item has been triaged labels May 22, 2026
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added 7 commits May 22, 2026 20:48
Each recommendation in processPurchaseRecommendations now runs under a
30s per-rec deadline derived from context.WithTimeout. This matches the
purchasecfg hard limits (2 retries * 15s HTTP timeout) so a single hung
rec (SDK retry storm, STS hang, or pagination hang) cannot exhaust the
Lambda budget and starve the remaining recs in the same execution.

On timeout, a purchase[<id>]: per-rec deadline 30s exceeded log line
fires so CloudWatch pinpoints which rec hung and what the elapsed time
was, without requiring a second failure to reproduce.

Closes part of issue #683.
Adds purchase[<exec-id>]: tagged timing logs at every stage of the
approval path so CloudWatch can pinpoint exactly where a future hang
occurs:

- approvePurchaseViaSession: entry (auth=session) + elapsed on exit
- ApproveExecution: entry (auth=token) + elapsed on exit for both the
  preflight-reject path and the full execute path
- ApproveAndExecute: entry + status-transition elapsed + execute elapsed

Actor emails are masked to ***@Domain before logging (PII rule).

Part of issue #683 diagnostic logging extension.
…olvers

Adds purchase[resolveAccountProvider/resolveAWSProvider/resolveAzureProvider/
resolveGCPProvider]: entry + elapsed-on-exit logs so CloudWatch can show
whether a hang is in credential resolution (STS AssumeRole, OIDC token
exchange) vs the cloud API call itself.

Each log includes provider, account ID, auth mode, and elapsed time.
Error paths also log before returning so a failed resolution is visible
in the trace even if the caller swallows the error in an aggregator.

Part of issue #683 diagnostic logging extension.
…ming logs

Instruments the two network calls that ValidateCredentials makes during
provider construction so CloudWatch distinguishes a credential-validation
hang from the subsequent offering-search or purchase hang:

- providers/aws/provider.go: purchase[STS]: GetCallerIdentity starting /
  returned in <elapsed> / failed after <elapsed>
- providers/azure/provider.go: purchase[Azure]: subscriptions.NewListPager/
  NextPage starting / returned in <elapsed> / failed after <elapsed>

Both use the purchase[...]: prefix for grep-ability. Profile/subscription
context is logged but no tokens, key material, or full identities.

Part of issue #683 diagnostic logging extension.
…gID in all 7 AWS service clients

Adds purchase[<exec-id>]: tagged per-pagination-page timing logs to the
findOfferingID loop in each of the 7 AWS purchase-path service clients
(EC2, RDS, ElastiCache, MemoryDB, OpenSearch, Redshift, SavingsPlans).

Each log line records:
- Entry: service, instance/node type, term, payment option
- Per page: page number, offering count returned, page elapsed
- Hit: page number where match was found, total elapsed
- Exhausted: total pages scanned, total elapsed (no-match path)

The exec-id is threaded from PurchaseCommitment(opts.ExecutionID) into
findOfferingID so all lines share the same purchase[<uuid>]: prefix and a
single CloudWatch Logs Insights query can reconstruct the full pagination
trace for any hanging execution.

ValidateOffering and GetOfferingDetails callers pass "" (rendered as
"no-exec") since they run outside the purchase flow.

Part of issue #683 diagnostic logging extension.
…currency

Adds purchase[fan-out]: entry + elapsed-on-exit log lines to each
goroutine launched by FanOutWithConcurrency. Both the account fan-out
(multi-account execution) and the per-rec fan-out (processPurchaseRec-
ommendations) go through FanOutWithConcurrency, so one addition covers
both call sites.

Logs:
- purchase[fan-out]: goroutine starting for account=<id>
- purchase[fan-out]: goroutine for account=<id> completed in <elapsed>
- purchase[fan-out]: goroutine for account=<id> failed after <elapsed>: <err>

Together with the per-rec ctx.WithTimeout and findOfferingID page logs,
a CloudWatch Logs Insights query on purchase[ now reconstructs the full
timing tree for a stuck execution: fan-out goroutine start -> credential
resolution -> STS -> service client -> findOfferingID pagination ->
PurchaseCommitment.

Part of issue #683 diagnostic logging extension.
…tchers

Move context.WithTimeout from the fan-out closure (processPurchaseRecommendations)
into executeSinglePurchase where it belongs -- the budget governs the cloud API
call chain, not the fan-out bookkeeping.

Update all GetServiceClient and PurchaseCommitment mock.On() setups in
execution_test.go, coverage_extra_test.go, and manager_test.go to match
on deadline-carrying contexts via:

  mock.MatchedBy(func(c context.Context) bool { _, ok := c.Deadline(); return ok })

This positively asserts the 30s per-rec timeout is propagated to every
cloud SDK call rather than silently accepting any context (mock.Anything).

Also fix the savingsplans/client_test.go build error: findOfferingID gained
an execID string parameter in commit 5 but the direct test call was not
updated (only the PurchaseCommitment path was updated).

All 180 purchase package tests and 308 AWS service tests now pass.

Part of issue #683 diagnostic logging extension.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
internal/purchase/execution_test.go (1)

70-73: ⚡ Quick win

Strengthen provider-factory context assertions to preserve the 30s budget contract.

Using mock.Anything for CreateAndValidateProvider no longer verifies deadline propagation on the provider-construction step. A deadline-aware matcher here would keep this contract covered end-to-end.

Suggested pattern
+deadlineCtx := mock.MatchedBy(func(c context.Context) bool {
+	d, ok := c.Deadline()
+	if !ok {
+		return false
+	}
+	remaining := time.Until(d)
+	return remaining > 0 && remaining <= 31*time.Second
+})
-
-mockFactory.On("CreateAndValidateProvider", mock.Anything, "aws", mock.Anything).Return(mockProviderInst, nil)
+mockFactory.On("CreateAndValidateProvider", deadlineCtx, "aws", mock.Anything).Return(mockProviderInst, nil)

Also applies to: 130-135, 186-191, 531-534, 759-762, 924-931, 1042-1054, 1171-1180, 1394-1397, 1468-1476, 1550-1557, 1688-1691

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/purchase/execution_test.go` around lines 70 - 73, The test currently
uses mockFactory.On("CreateAndValidateProvider", mock.Anything, "aws",
mock.Anything) which doesn't assert the 30s deadline on the
provider-construction context; update the matcher for the first argument to a
deadline-aware matcher (e.g., mock.MatchedBy that checks ctx.Deadline() is set)
so CreateAndValidateProvider receives a context with a deadline, similar to how
GetServiceClient and PurchaseCommitment are asserted; apply the same change to
other occurrences listed (lines around 130-135, 186-191, etc.) referencing the
same method name CreateAndValidateProvider to preserve the deadline propagation
contract end-to-end.
providers/aws/internal/purchasecfg/config.go (1)

41-45: ⚡ Quick win

Preserve caller-provided aws.Config.HTTPClient at Line 44 (composability).

Line 44 unconditionally replaces base.HTTPClient with a new *http.Client{Timeout: HTTPTimeout}, which would drop any caller/custom transport/proxy/TLS/instrumentation settings if base.HTTPClient is non-default. In the main AWS provider path, aws.Config is loaded via config.LoadDefaultConfig with only region/profile/credentials options, so this likely doesn’t affect current production flows—but it reduces future safety and test composability.

Suggested fix
 func NewConfig(base aws.Config) aws.Config {
 	cfg := base.Copy()
 	cfg.RetryMaxAttempts = MaxAttempts
-	cfg.HTTPClient = &http.Client{Timeout: HTTPTimeout}
+	if hc, ok := base.HTTPClient.(*http.Client); ok && hc != nil {
+		clone := hc.Clone()
+		clone.Timeout = HTTPTimeout
+		cfg.HTTPClient = clone
+	} else {
+		cfg.HTTPClient = &http.Client{Timeout: HTTPTimeout}
+	}
 	return cfg
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/aws/internal/purchasecfg/config.go` around lines 41 - 45, NewConfig
currently overwrites any caller-provided HTTP client, dropping custom
transports/proxies/TLS/instrumentation; change it to preserve base.HTTPClient:
after copying base into cfg, only set cfg.HTTPClient = &http.Client{Timeout:
HTTPTimeout} when base.HTTPClient is nil, otherwise leave cfg.HTTPClient
untouched (or if you want to ensure a timeout, detect if cfg.HTTPClient is an
*http.Client and only set its Timeout when it's zero). Keep the rest
(cfg.RetryMaxAttempts = MaxAttempts) unchanged; this uses the NewConfig function
and symbols RetryMaxAttempts/HTTPTimeout to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/purchase/execution.go`:
- Around line 791-794: The current check treats any non-nil recCtx.Err() as the
30s per-rec timeout; update the branch that inspects recCtx.Err() to distinguish
context.DeadlineExceeded from context.Canceled (and other errors). Use
errors.Is(recCtx.Err(), context.DeadlineExceeded) to log the "per-rec deadline
30s exceeded" message, use errors.Is(recCtx.Err(), context.Canceled) to log a
"rec cancelled by parent/ctx canceled" message (including opts.ExecutionID,
recTuple, elapsed), and otherwise log the actual error from recCtx.Err(); ensure
to import the errors package if not present and keep logging via logging.Errorf
and the existing variables (recCtx, opts.ExecutionID, recTuple, elapsed).

---

Nitpick comments:
In `@internal/purchase/execution_test.go`:
- Around line 70-73: The test currently uses
mockFactory.On("CreateAndValidateProvider", mock.Anything, "aws", mock.Anything)
which doesn't assert the 30s deadline on the provider-construction context;
update the matcher for the first argument to a deadline-aware matcher (e.g.,
mock.MatchedBy that checks ctx.Deadline() is set) so CreateAndValidateProvider
receives a context with a deadline, similar to how GetServiceClient and
PurchaseCommitment are asserted; apply the same change to other occurrences
listed (lines around 130-135, 186-191, etc.) referencing the same method name
CreateAndValidateProvider to preserve the deadline propagation contract
end-to-end.

In `@providers/aws/internal/purchasecfg/config.go`:
- Around line 41-45: NewConfig currently overwrites any caller-provided HTTP
client, dropping custom transports/proxies/TLS/instrumentation; change it to
preserve base.HTTPClient: after copying base into cfg, only set cfg.HTTPClient =
&http.Client{Timeout: HTTPTimeout} when base.HTTPClient is nil, otherwise leave
cfg.HTTPClient untouched (or if you want to ensure a timeout, detect if
cfg.HTTPClient is an *http.Client and only set its Timeout when it's zero). Keep
the rest (cfg.RetryMaxAttempts = MaxAttempts) unchanged; this uses the NewConfig
function and symbols RetryMaxAttempts/HTTPTimeout to locate the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a97129d2-95a8-45dc-ab31-48891e82df8d

📥 Commits

Reviewing files that changed from the base of the PR and between b727ecb and e4ad3ea.

📒 Files selected for processing (22)
  • internal/api/handler_purchases.go
  • internal/execution/fanout.go
  • internal/purchase/approvals.go
  • internal/purchase/coverage_extra_test.go
  • internal/purchase/execution.go
  • internal/purchase/execution_test.go
  • internal/purchase/manager_test.go
  • providers/aws/internal/purchasecfg/config.go
  • providers/aws/internal/purchasecfg/config_test.go
  • providers/aws/provider.go
  • providers/aws/services/ec2/client.go
  • providers/aws/services/elasticache/client.go
  • providers/aws/services/memorydb/client.go
  • providers/aws/services/opensearch/client.go
  • providers/aws/services/rds/client.go
  • providers/aws/services/redshift/client.go
  • providers/aws/services/savingsplans/client.go
  • providers/aws/services/savingsplans/client_test.go
  • providers/azure/provider.go
  • terraform/environments/aws/github-dev.tfvars
  • terraform/environments/aws/github-prod.tfvars
  • terraform/environments/aws/github-staging.tfvars

Comment thread internal/purchase/execution.go Outdated
Three changes per CR pass-1 triage:

1. Actionable (execution.go): distinguish context.DeadlineExceeded from
   context.Canceled in the post-PurchaseCommitment error log. The previous
   branch logged "per-rec deadline 30s exceeded" for any non-nil recCtx.Err(),
   including parent cancellations -- mislabeling Canceled as a timeout makes
   diagnostic triage harder. Logic extracted into logRecCtxErr helper to keep
   executeSinglePurchase cyclomatic complexity within the gocyclo limit.

2. Nitpick (execution_test.go): replace mock.Anything with a deadline-aware
   mock.MatchedBy on the first arg of all 12 CreateAndValidateProvider
   expectations. The 30s per-rec budget is set before the factory call, so
   all three mocked stages (factory, GetServiceClient, PurchaseCommitment)
   now assert deadline propagation end-to-end.

3. Nitpick (purchasecfg/config.go + config_test.go): preserve the caller's
   *http.Client when one is provided in base config. NewConfig now shallow-
   copies the client and only overrides Timeout, keeping any custom Transport,
   Jar, or CheckRedirect. Falls back to a fresh client when base.HTTPClient is
   nil or a non-*http.Client implementation. Adds
   TestNewConfig_PreservesCustomTransport to pin the contract.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

CodeRabbit pass-1 response

All three findings addressed in commit ee235cf.

# File Finding Action Rationale
1 internal/purchase/execution.go:791 Mislabels all context cancellations as 30s per-rec timeout Fixed Extracted logRecCtxErr helper that distinguishes context.DeadlineExceeded from context.Canceled; kept out of the main function body so gocyclo stays under 10.
2 internal/purchase/execution_test.go (12 occurrences) CreateAndValidateProvider uses mock.Anything for ctx, not asserting 30s deadline propagation Fixed Replaced all 12 mock.Anything first-args with mock.MatchedBy(func(c context.Context) bool { _, ok := c.Deadline(); return ok }) to cover the full factory->client->SDK chain end-to-end.
3 providers/aws/internal/purchasecfg/config.go:44 NewConfig unconditionally replaces HTTPClient, dropping caller custom transports Fixed Shallow-copies the *http.Client when one is provided and only overrides Timeout; falls back to fresh client otherwise. Added TestNewConfig_PreservesCustomTransport to pin the contract.

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

@cristim Thanks for the detailed summary — all three fixes look well-targeted:

  1. logRecCtxErr helper — cleanly separates DeadlineExceeded (per-rec 30 s budget hit) from Canceled (upstream cancellation), without inflating cyclomatic complexity. Good call keeping it out of the main function body.
  2. mock.MatchedBy deadline assertions — covering all 12 call sites ensures the full factory → client → SDK chain actually receives a deadline-bound context, not just the outer one.
  3. Shallow-copy *http.Client in NewConfig — preserving custom transports while only overriding Timeout is the correct contract; pinning it with TestNewConfig_PreservesCustomTransport makes the behaviour explicit.

Triggering a fresh review now.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/purchase/execution_test.go`:
- Line 70: The current test uses a context-deadline matcher that only checks for
any deadline, which allows reusing a longer parent deadline; update the mock
expectations that call mockFactory.On("CreateAndValidateProvider", ...) (and the
other identical sites) to assert the remaining deadline is ~30s (with a small
tolerance) by reusing the existing timeout-deadline matcher helper used
elsewhere in tests (the helper that checks WithTimeout(..., 30*time.Second)
behavior) so executeSinglePurchase is verified to create a 30s per-request
deadline rather than inheriting a longer parent context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 473da1cd-7126-4151-a248-dab9dda3b0b4

📥 Commits

Reviewing files that changed from the base of the PR and between e4ad3ea and ee235cf.

📒 Files selected for processing (4)
  • internal/purchase/execution.go
  • internal/purchase/execution_test.go
  • providers/aws/internal/purchasecfg/config.go
  • providers/aws/internal/purchasecfg/config_test.go

Comment thread internal/purchase/execution_test.go Outdated
…684)

CodeRabbit on PR #684 (review 4348606803) noted that the 53 mock matchers
using `_, ok := c.Deadline(); return ok` only assert *some* deadline is
set -- they would still pass if `executeSinglePurchase` silently reused
the parent's longer deadline instead of its own
`context.WithTimeout(ctx, 30*time.Second)` wrap.

Tighten the contract: introduce `hasPerRecDeadline(max time.Duration)` in
`execution_test.go` and replace all 53 sites across `execution_test.go`,
`coverage_extra_test.go`, and `manager_test.go` with
`mock.MatchedBy(hasPerRecDeadline(30*time.Second))`. The matcher now
positively asserts the remaining deadline is in (0, 30s], so removing or
loosening the wrap fails the test.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added a commit that referenced this pull request May 22, 2026
* fix(purchases): narrow Describe*Offerings + cap pagination (#688)

Root cause: EC2 findOfferingID was missing the offering-type filter,
causing AWS to return all payment variants, triggering indefinite
pagination through sparse empty pages and burning the entire Lambda
60s budget (execution 9336a111).

Changes per service (all 7 purchase-path clients):
- EC2: add OfferingType direct field + convertEC2PaymentOption helper;
  verify returned offering type before returning; cap at 5 pages;
  check ctx.Err() between pages; add timing log per page
- RDS: add pagination cap + ctx.Err check + variant verification +
  per-page timing log (already had all filter fields)
- ElastiCache: same as RDS
- MemoryDB: add OfferingType + Duration direct fields to narrow at
  API level; replace client-side matchesDuration/matchesOfferingType
  loop with server-side filters; cap + ctx.Err + variant verification
- OpenSearch: API has no filter fields; add cap + ctx.Err + defense-
  in-depth variant check after client-side matching
- Redshift: API has no node-type/payment filter fields; add cap +
  ctx.Err + unknown-type guard
- SavingsPlans: add pagination loop + cap + ctx.Err to lookupOfferingID
  (was a single non-paginated call)

MaxResults/MaxRecords: EC2/Redshift/RDS/ElastiCache APIs have a hard
max of 100 per page; no change from current value. MemoryDB/OpenSearch
were already at 100.

Regression tests (3 per service = 21 new tests):
- Pagination cap fires after maxOfferingPages empty pages + next token
- Wrong-variant offering rejected by defense-in-depth guard
- Happy path returns correct offering ID on first page

Refs #684 #667 #632

* fix(ec2): use typed top-level fields, not Filters[], on offering describe (#688)

Followup on the same #688 issue. The initial commit kept InstanceType,
ProductDescription, InstanceTenancy, duration, and OfferingClass packed
into DescribeReservedInstancesOfferingsInput.Filters[], with only
OfferingType promoted to the typed top-level field.

Live-AWS verification showed that the Filter[]-heavy shape is what causes
AWS to return empty pages with NextToken on sparse offering sets (e.g.
t4g.nano regional no-upfront convertible), walking until the Lambda
budget expires. The all-typed-fields shape returns the exact matching
offering immediately. Typed fields land on AWS's primary indices; Filter[]
is a generic fallback that does not always hit the same indices.

Changes:
- Move InstanceType, ProductDescription, InstanceTenancy, MinDuration,
  MaxDuration, OfferingClass to typed top-level fields on the input struct.
- Keep only scope as a Filter[] entry (no typed equivalent on the input).
- Use the typed enum constants (types.InstanceType, types.RIProductDescription,
  types.Tenancy, types.OfferingClassTypeConvertible) instead of stringly-typed
  filter values.
- Extract ec2OfferingQuery struct + buildEC2OfferingQuery / describeInputFromQuery
  helpers so findOfferingID stays under the gocyclo cap.
- Delete the now-unused buildOfferingFilters helper and the always-returns-
  "convertible" getOfferingClass helper.
- Change scanEC2OfferingPage from "hard error on variant mismatch" to "soft
  skip + log and continue scanning". With the typed OfferingType field set
  this should never fire; if AWS somehow returns a mismatched variant we
  want to keep looking on later pages rather than fail the rec.
- Update TestFindOfferingID_WrongVariantRejected to assert the new soft-skip
  behaviour (mismatched variant on the only page -> "no offerings found").
- Drop TestBuildOfferingFilters_LegacyCanonicalization: it exercised the
  deleted helper. The underlying canonicalizeEC2Tenancy / canonicalizeEC2Scope
  helpers retain their own direct unit tests.

* fix(purchases): soft-skip guards + nil/empty token parity (closes #688)

Three related cleanups on the offering-lookup pipeline:

1. EC2/RDS: normalize end-of-pagination check to guard against both nil
   and empty-string NextToken/Marker. AWS may return either form for the
   terminal page; only checking nil risks an extra empty-page request.
   EC2 uses a new isLastEC2Page helper so the intent is explicit.

2. OpenSearch: remove the post-match payment-option defense-in-depth
   guard from scanOpenSearchOfferingPage. matchesPaymentOption already
   filters mismatched variants in the same loop iteration; the redundant
   hard error can never fire in practice and makes the wrong-variant test
   harder to reason about. Soft outcome (loop exhausts -> "no offerings
   found") is safer for the caller.

3. Redshift: same rationale as OpenSearch -- remove the hard-error guard
   for unknown offering type from scanRedshiftOfferingPage.

Update TestFindOfferingID_WrongVariantRejected for both services to
assert the soft-failure shape (err != nil, id == "").

* fix(purchases): address CR #690 category A/B/C findings

Category A (4 sites, nil-or-empty pagination terminator):
- elasticache: guard result.Marker against both nil and "" in
  paginateElastiCacheOfferings; avoid the redundant page that a
  non-nil empty-string marker would otherwise trigger
- opensearch, rds, redshift: same nil-or-empty guard on their
  respective NextToken/Marker fields (EC2 was already fixed in
  5f8f269)

Category B (ElastiCache payment option coercion):
- convertPaymentOption now returns (string, error); unknown values
  return an explicit "unsupported ElastiCache payment option" error
  instead of silently mapping to "Partial Upfront"
- findOfferingID propagates the error before any pagination begins
- TestClient_ConvertPaymentOption updated to assert the (result,
  error) return shape and verify the unknown-input error path

Category C (unreachable variant-mismatch guard):
- OpenSearch: replace the matchesPaymentOption pre-filter continue
  with a direct gotPayment != wantPayment check that returns an
  explicit "payment option mismatch" error; guard is now reachable
- Redshift: replace matchesOfferingType pre-filter continue with a
  direct "Regular"/"Upgradable" enum check that returns an explicit
  "unexpected type" error on first mismatch
- Test comments and assertions updated to match the new behaviour
  (mismatch surfaces as a diagnostic error, not a silent skip)

CR #690 review round 1

* chore: restore Terraform lock files to feat branch state

The previous commit accidentally included Terraform provider constraint
changes unrelated to the #688 offering-lookup fix. Restore the three
lock files to their state on feat/multicloud-web-frontend so this
branch stays focused on the purchase-path narrowing changes only.
@cristim
Copy link
Copy Markdown
Member Author

cristim commented May 22, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 22, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/m Days impact/all-users Affects every user priority/p0 Drop everything; same-day fix severity/critical Major harm when it happens triaged Item has been triaged type/bug Defect urgency/now Drop other things

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant