fix(purchases): tighten AWS SDK retry budget + bump Lambda timeout to 300s (closes #683)#684
fix(purchases): tighten AWS SDK retry budget + bump Lambda timeout to 300s (closes #683)#684cristim wants to merge 12 commits into
Conversation
…K retries Introduce providers/aws/internal/purchasecfg.NewConfig which returns a copy of the caller's aws.Config with RetryMaxAttempts=2 and http.Client.Timeout=15s. Worst-case wall clock: 2 * 15s = 30s, well inside the 300s Lambda budget. Recommendation-collection clients (recommendations.NewClient) are explicitly NOT touched; they call Cost Explorer and need the higher default retry budget. Adds five unit tests covering: retry cap, HTTP timeout, immutability of the base config, region preservation, and the wall-clock bound invariant. Closes #683. Refs #667 #668 #632 #681.
…ctors Switch EC2, RDS, ElastiCache, MemoryDB, OpenSearch, Redshift, and SavingsPlans NewClient calls to use purchasecfg.NewConfig(cfg) so every purchase-path SDK client gets RetryMaxAttempts=2 and http.Client.Timeout=15s. OpenSearch and Redshift also pass the tightened config to their STS client (used for ARN construction / tagging on the purchase path). Recommendation-collection clients (recommendations.NewClient) are not touched; they use their own constructor and deliberately keep the SDK default retry budget for Cost Explorer calls. All 694 tests pass. go build ./... clean. Closes #683. Refs #667.
Raise lambda_timeout from 60s to 300s in github-dev, github-staging, and github-prod tfvars. The SDK retry tightening (purchasecfg, max 30s wall clock) combined with the increased Lambda budget means transient AWS API slowness either completes within 30s or surfaces a clear error to the user -- no more silent context deadline exceeded from the Lambda itself expiring mid-retry. Lambda Function URL has no upstream timeout ceiling, so 300s is safe. Closes #683. Refs #667 #681.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughAdds timing and structured logs to approval and provider flows, enforces 30s per-recommendation timeouts, introduces purchase-specific AWS SDK config (2 retries, 15s HTTP timeout) wired to purchase-path clients, relaxes tests to accept deadline-bearing contexts, and increases Lambda timeout to 300s. ChangesPurchase Approval and Execution Hardening
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Each recommendation in processPurchaseRecommendations now runs under a 30s per-rec deadline derived from context.WithTimeout. This matches the purchasecfg hard limits (2 retries * 15s HTTP timeout) so a single hung rec (SDK retry storm, STS hang, or pagination hang) cannot exhaust the Lambda budget and starve the remaining recs in the same execution. On timeout, a purchase[<id>]: per-rec deadline 30s exceeded log line fires so CloudWatch pinpoints which rec hung and what the elapsed time was, without requiring a second failure to reproduce. Closes part of issue #683.
Adds purchase[<exec-id>]: tagged timing logs at every stage of the approval path so CloudWatch can pinpoint exactly where a future hang occurs: - approvePurchaseViaSession: entry (auth=session) + elapsed on exit - ApproveExecution: entry (auth=token) + elapsed on exit for both the preflight-reject path and the full execute path - ApproveAndExecute: entry + status-transition elapsed + execute elapsed Actor emails are masked to ***@Domain before logging (PII rule). Part of issue #683 diagnostic logging extension.
…olvers Adds purchase[resolveAccountProvider/resolveAWSProvider/resolveAzureProvider/ resolveGCPProvider]: entry + elapsed-on-exit logs so CloudWatch can show whether a hang is in credential resolution (STS AssumeRole, OIDC token exchange) vs the cloud API call itself. Each log includes provider, account ID, auth mode, and elapsed time. Error paths also log before returning so a failed resolution is visible in the trace even if the caller swallows the error in an aggregator. Part of issue #683 diagnostic logging extension.
…ming logs Instruments the two network calls that ValidateCredentials makes during provider construction so CloudWatch distinguishes a credential-validation hang from the subsequent offering-search or purchase hang: - providers/aws/provider.go: purchase[STS]: GetCallerIdentity starting / returned in <elapsed> / failed after <elapsed> - providers/azure/provider.go: purchase[Azure]: subscriptions.NewListPager/ NextPage starting / returned in <elapsed> / failed after <elapsed> Both use the purchase[...]: prefix for grep-ability. Profile/subscription context is logged but no tokens, key material, or full identities. Part of issue #683 diagnostic logging extension.
…gID in all 7 AWS service clients Adds purchase[<exec-id>]: tagged per-pagination-page timing logs to the findOfferingID loop in each of the 7 AWS purchase-path service clients (EC2, RDS, ElastiCache, MemoryDB, OpenSearch, Redshift, SavingsPlans). Each log line records: - Entry: service, instance/node type, term, payment option - Per page: page number, offering count returned, page elapsed - Hit: page number where match was found, total elapsed - Exhausted: total pages scanned, total elapsed (no-match path) The exec-id is threaded from PurchaseCommitment(opts.ExecutionID) into findOfferingID so all lines share the same purchase[<uuid>]: prefix and a single CloudWatch Logs Insights query can reconstruct the full pagination trace for any hanging execution. ValidateOffering and GetOfferingDetails callers pass "" (rendered as "no-exec") since they run outside the purchase flow. Part of issue #683 diagnostic logging extension.
…currency Adds purchase[fan-out]: entry + elapsed-on-exit log lines to each goroutine launched by FanOutWithConcurrency. Both the account fan-out (multi-account execution) and the per-rec fan-out (processPurchaseRec- ommendations) go through FanOutWithConcurrency, so one addition covers both call sites. Logs: - purchase[fan-out]: goroutine starting for account=<id> - purchase[fan-out]: goroutine for account=<id> completed in <elapsed> - purchase[fan-out]: goroutine for account=<id> failed after <elapsed>: <err> Together with the per-rec ctx.WithTimeout and findOfferingID page logs, a CloudWatch Logs Insights query on purchase[ now reconstructs the full timing tree for a stuck execution: fan-out goroutine start -> credential resolution -> STS -> service client -> findOfferingID pagination -> PurchaseCommitment. Part of issue #683 diagnostic logging extension.
…tchers
Move context.WithTimeout from the fan-out closure (processPurchaseRecommendations)
into executeSinglePurchase where it belongs -- the budget governs the cloud API
call chain, not the fan-out bookkeeping.
Update all GetServiceClient and PurchaseCommitment mock.On() setups in
execution_test.go, coverage_extra_test.go, and manager_test.go to match
on deadline-carrying contexts via:
mock.MatchedBy(func(c context.Context) bool { _, ok := c.Deadline(); return ok })
This positively asserts the 30s per-rec timeout is propagated to every
cloud SDK call rather than silently accepting any context (mock.Anything).
Also fix the savingsplans/client_test.go build error: findOfferingID gained
an execID string parameter in commit 5 but the direct test call was not
updated (only the PurchaseCommitment path was updated).
All 180 purchase package tests and 308 AWS service tests now pass.
Part of issue #683 diagnostic logging extension.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
internal/purchase/execution_test.go (1)
70-73: ⚡ Quick winStrengthen provider-factory context assertions to preserve the 30s budget contract.
Using
mock.AnythingforCreateAndValidateProviderno longer verifies deadline propagation on the provider-construction step. A deadline-aware matcher here would keep this contract covered end-to-end.Suggested pattern
+deadlineCtx := mock.MatchedBy(func(c context.Context) bool { + d, ok := c.Deadline() + if !ok { + return false + } + remaining := time.Until(d) + return remaining > 0 && remaining <= 31*time.Second +}) - -mockFactory.On("CreateAndValidateProvider", mock.Anything, "aws", mock.Anything).Return(mockProviderInst, nil) +mockFactory.On("CreateAndValidateProvider", deadlineCtx, "aws", mock.Anything).Return(mockProviderInst, nil)Also applies to: 130-135, 186-191, 531-534, 759-762, 924-931, 1042-1054, 1171-1180, 1394-1397, 1468-1476, 1550-1557, 1688-1691
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/purchase/execution_test.go` around lines 70 - 73, The test currently uses mockFactory.On("CreateAndValidateProvider", mock.Anything, "aws", mock.Anything) which doesn't assert the 30s deadline on the provider-construction context; update the matcher for the first argument to a deadline-aware matcher (e.g., mock.MatchedBy that checks ctx.Deadline() is set) so CreateAndValidateProvider receives a context with a deadline, similar to how GetServiceClient and PurchaseCommitment are asserted; apply the same change to other occurrences listed (lines around 130-135, 186-191, etc.) referencing the same method name CreateAndValidateProvider to preserve the deadline propagation contract end-to-end.providers/aws/internal/purchasecfg/config.go (1)
41-45: ⚡ Quick winPreserve caller-provided
aws.Config.HTTPClientat Line 44 (composability).Line 44 unconditionally replaces
base.HTTPClientwith a new*http.Client{Timeout: HTTPTimeout}, which would drop any caller/custom transport/proxy/TLS/instrumentation settings ifbase.HTTPClientis non-default. In the main AWS provider path,aws.Configis loaded viaconfig.LoadDefaultConfigwith only region/profile/credentials options, so this likely doesn’t affect current production flows—but it reduces future safety and test composability.Suggested fix
func NewConfig(base aws.Config) aws.Config { cfg := base.Copy() cfg.RetryMaxAttempts = MaxAttempts - cfg.HTTPClient = &http.Client{Timeout: HTTPTimeout} + if hc, ok := base.HTTPClient.(*http.Client); ok && hc != nil { + clone := hc.Clone() + clone.Timeout = HTTPTimeout + cfg.HTTPClient = clone + } else { + cfg.HTTPClient = &http.Client{Timeout: HTTPTimeout} + } return cfg }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@providers/aws/internal/purchasecfg/config.go` around lines 41 - 45, NewConfig currently overwrites any caller-provided HTTP client, dropping custom transports/proxies/TLS/instrumentation; change it to preserve base.HTTPClient: after copying base into cfg, only set cfg.HTTPClient = &http.Client{Timeout: HTTPTimeout} when base.HTTPClient is nil, otherwise leave cfg.HTTPClient untouched (or if you want to ensure a timeout, detect if cfg.HTTPClient is an *http.Client and only set its Timeout when it's zero). Keep the rest (cfg.RetryMaxAttempts = MaxAttempts) unchanged; this uses the NewConfig function and symbols RetryMaxAttempts/HTTPTimeout to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/purchase/execution.go`:
- Around line 791-794: The current check treats any non-nil recCtx.Err() as the
30s per-rec timeout; update the branch that inspects recCtx.Err() to distinguish
context.DeadlineExceeded from context.Canceled (and other errors). Use
errors.Is(recCtx.Err(), context.DeadlineExceeded) to log the "per-rec deadline
30s exceeded" message, use errors.Is(recCtx.Err(), context.Canceled) to log a
"rec cancelled by parent/ctx canceled" message (including opts.ExecutionID,
recTuple, elapsed), and otherwise log the actual error from recCtx.Err(); ensure
to import the errors package if not present and keep logging via logging.Errorf
and the existing variables (recCtx, opts.ExecutionID, recTuple, elapsed).
---
Nitpick comments:
In `@internal/purchase/execution_test.go`:
- Around line 70-73: The test currently uses
mockFactory.On("CreateAndValidateProvider", mock.Anything, "aws", mock.Anything)
which doesn't assert the 30s deadline on the provider-construction context;
update the matcher for the first argument to a deadline-aware matcher (e.g.,
mock.MatchedBy that checks ctx.Deadline() is set) so CreateAndValidateProvider
receives a context with a deadline, similar to how GetServiceClient and
PurchaseCommitment are asserted; apply the same change to other occurrences
listed (lines around 130-135, 186-191, etc.) referencing the same method name
CreateAndValidateProvider to preserve the deadline propagation contract
end-to-end.
In `@providers/aws/internal/purchasecfg/config.go`:
- Around line 41-45: NewConfig currently overwrites any caller-provided HTTP
client, dropping custom transports/proxies/TLS/instrumentation; change it to
preserve base.HTTPClient: after copying base into cfg, only set cfg.HTTPClient =
&http.Client{Timeout: HTTPTimeout} when base.HTTPClient is nil, otherwise leave
cfg.HTTPClient untouched (or if you want to ensure a timeout, detect if
cfg.HTTPClient is an *http.Client and only set its Timeout when it's zero). Keep
the rest (cfg.RetryMaxAttempts = MaxAttempts) unchanged; this uses the NewConfig
function and symbols RetryMaxAttempts/HTTPTimeout to locate the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a97129d2-95a8-45dc-ab31-48891e82df8d
📒 Files selected for processing (22)
internal/api/handler_purchases.gointernal/execution/fanout.gointernal/purchase/approvals.gointernal/purchase/coverage_extra_test.gointernal/purchase/execution.gointernal/purchase/execution_test.gointernal/purchase/manager_test.goproviders/aws/internal/purchasecfg/config.goproviders/aws/internal/purchasecfg/config_test.goproviders/aws/provider.goproviders/aws/services/ec2/client.goproviders/aws/services/elasticache/client.goproviders/aws/services/memorydb/client.goproviders/aws/services/opensearch/client.goproviders/aws/services/rds/client.goproviders/aws/services/redshift/client.goproviders/aws/services/savingsplans/client.goproviders/aws/services/savingsplans/client_test.goproviders/azure/provider.goterraform/environments/aws/github-dev.tfvarsterraform/environments/aws/github-prod.tfvarsterraform/environments/aws/github-staging.tfvars
Three changes per CR pass-1 triage: 1. Actionable (execution.go): distinguish context.DeadlineExceeded from context.Canceled in the post-PurchaseCommitment error log. The previous branch logged "per-rec deadline 30s exceeded" for any non-nil recCtx.Err(), including parent cancellations -- mislabeling Canceled as a timeout makes diagnostic triage harder. Logic extracted into logRecCtxErr helper to keep executeSinglePurchase cyclomatic complexity within the gocyclo limit. 2. Nitpick (execution_test.go): replace mock.Anything with a deadline-aware mock.MatchedBy on the first arg of all 12 CreateAndValidateProvider expectations. The 30s per-rec budget is set before the factory call, so all three mocked stages (factory, GetServiceClient, PurchaseCommitment) now assert deadline propagation end-to-end. 3. Nitpick (purchasecfg/config.go + config_test.go): preserve the caller's *http.Client when one is provided in base config. NewConfig now shallow- copies the client and only overrides Timeout, keeping any custom Transport, Jar, or CheckRedirect. Falls back to a fresh client when base.HTTPClient is nil or a non-*http.Client implementation. Adds TestNewConfig_PreservesCustomTransport to pin the contract.
CodeRabbit pass-1 responseAll three findings addressed in commit ee235cf.
@coderabbitai review |
|
Triggering a fresh review now. ✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/purchase/execution_test.go`:
- Line 70: The current test uses a context-deadline matcher that only checks for
any deadline, which allows reusing a longer parent deadline; update the mock
expectations that call mockFactory.On("CreateAndValidateProvider", ...) (and the
other identical sites) to assert the remaining deadline is ~30s (with a small
tolerance) by reusing the existing timeout-deadline matcher helper used
elsewhere in tests (the helper that checks WithTimeout(..., 30*time.Second)
behavior) so executeSinglePurchase is verified to create a 30s per-request
deadline rather than inheriting a longer parent context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 473da1cd-7126-4151-a248-dab9dda3b0b4
📒 Files selected for processing (4)
internal/purchase/execution.gointernal/purchase/execution_test.goproviders/aws/internal/purchasecfg/config.goproviders/aws/internal/purchasecfg/config_test.go
…684) CodeRabbit on PR #684 (review 4348606803) noted that the 53 mock matchers using `_, ok := c.Deadline(); return ok` only assert *some* deadline is set -- they would still pass if `executeSinglePurchase` silently reused the parent's longer deadline instead of its own `context.WithTimeout(ctx, 30*time.Second)` wrap. Tighten the contract: introduce `hasPerRecDeadline(max time.Duration)` in `execution_test.go` and replace all 53 sites across `execution_test.go`, `coverage_extra_test.go`, and `manager_test.go` with `mock.MatchedBy(hasPerRecDeadline(30*time.Second))`. The matcher now positively asserts the remaining deadline is in (0, 30s], so removing or loosening the wrap fails the test.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
* fix(purchases): narrow Describe*Offerings + cap pagination (#688) Root cause: EC2 findOfferingID was missing the offering-type filter, causing AWS to return all payment variants, triggering indefinite pagination through sparse empty pages and burning the entire Lambda 60s budget (execution 9336a111). Changes per service (all 7 purchase-path clients): - EC2: add OfferingType direct field + convertEC2PaymentOption helper; verify returned offering type before returning; cap at 5 pages; check ctx.Err() between pages; add timing log per page - RDS: add pagination cap + ctx.Err check + variant verification + per-page timing log (already had all filter fields) - ElastiCache: same as RDS - MemoryDB: add OfferingType + Duration direct fields to narrow at API level; replace client-side matchesDuration/matchesOfferingType loop with server-side filters; cap + ctx.Err + variant verification - OpenSearch: API has no filter fields; add cap + ctx.Err + defense- in-depth variant check after client-side matching - Redshift: API has no node-type/payment filter fields; add cap + ctx.Err + unknown-type guard - SavingsPlans: add pagination loop + cap + ctx.Err to lookupOfferingID (was a single non-paginated call) MaxResults/MaxRecords: EC2/Redshift/RDS/ElastiCache APIs have a hard max of 100 per page; no change from current value. MemoryDB/OpenSearch were already at 100. Regression tests (3 per service = 21 new tests): - Pagination cap fires after maxOfferingPages empty pages + next token - Wrong-variant offering rejected by defense-in-depth guard - Happy path returns correct offering ID on first page Refs #684 #667 #632 * fix(ec2): use typed top-level fields, not Filters[], on offering describe (#688) Followup on the same #688 issue. The initial commit kept InstanceType, ProductDescription, InstanceTenancy, duration, and OfferingClass packed into DescribeReservedInstancesOfferingsInput.Filters[], with only OfferingType promoted to the typed top-level field. Live-AWS verification showed that the Filter[]-heavy shape is what causes AWS to return empty pages with NextToken on sparse offering sets (e.g. t4g.nano regional no-upfront convertible), walking until the Lambda budget expires. The all-typed-fields shape returns the exact matching offering immediately. Typed fields land on AWS's primary indices; Filter[] is a generic fallback that does not always hit the same indices. Changes: - Move InstanceType, ProductDescription, InstanceTenancy, MinDuration, MaxDuration, OfferingClass to typed top-level fields on the input struct. - Keep only scope as a Filter[] entry (no typed equivalent on the input). - Use the typed enum constants (types.InstanceType, types.RIProductDescription, types.Tenancy, types.OfferingClassTypeConvertible) instead of stringly-typed filter values. - Extract ec2OfferingQuery struct + buildEC2OfferingQuery / describeInputFromQuery helpers so findOfferingID stays under the gocyclo cap. - Delete the now-unused buildOfferingFilters helper and the always-returns- "convertible" getOfferingClass helper. - Change scanEC2OfferingPage from "hard error on variant mismatch" to "soft skip + log and continue scanning". With the typed OfferingType field set this should never fire; if AWS somehow returns a mismatched variant we want to keep looking on later pages rather than fail the rec. - Update TestFindOfferingID_WrongVariantRejected to assert the new soft-skip behaviour (mismatched variant on the only page -> "no offerings found"). - Drop TestBuildOfferingFilters_LegacyCanonicalization: it exercised the deleted helper. The underlying canonicalizeEC2Tenancy / canonicalizeEC2Scope helpers retain their own direct unit tests. * fix(purchases): soft-skip guards + nil/empty token parity (closes #688) Three related cleanups on the offering-lookup pipeline: 1. EC2/RDS: normalize end-of-pagination check to guard against both nil and empty-string NextToken/Marker. AWS may return either form for the terminal page; only checking nil risks an extra empty-page request. EC2 uses a new isLastEC2Page helper so the intent is explicit. 2. OpenSearch: remove the post-match payment-option defense-in-depth guard from scanOpenSearchOfferingPage. matchesPaymentOption already filters mismatched variants in the same loop iteration; the redundant hard error can never fire in practice and makes the wrong-variant test harder to reason about. Soft outcome (loop exhausts -> "no offerings found") is safer for the caller. 3. Redshift: same rationale as OpenSearch -- remove the hard-error guard for unknown offering type from scanRedshiftOfferingPage. Update TestFindOfferingID_WrongVariantRejected for both services to assert the soft-failure shape (err != nil, id == ""). * fix(purchases): address CR #690 category A/B/C findings Category A (4 sites, nil-or-empty pagination terminator): - elasticache: guard result.Marker against both nil and "" in paginateElastiCacheOfferings; avoid the redundant page that a non-nil empty-string marker would otherwise trigger - opensearch, rds, redshift: same nil-or-empty guard on their respective NextToken/Marker fields (EC2 was already fixed in 5f8f269) Category B (ElastiCache payment option coercion): - convertPaymentOption now returns (string, error); unknown values return an explicit "unsupported ElastiCache payment option" error instead of silently mapping to "Partial Upfront" - findOfferingID propagates the error before any pagination begins - TestClient_ConvertPaymentOption updated to assert the (result, error) return shape and verify the unknown-input error path Category C (unreachable variant-mismatch guard): - OpenSearch: replace the matchesPaymentOption pre-filter continue with a direct gotPayment != wantPayment check that returns an explicit "payment option mismatch" error; guard is now reachable - Redshift: replace matchesOfferingType pre-filter continue with a direct "Regular"/"Upgradable" enum check that returns an explicit "unexpected type" error on first mismatch - Test comments and assertions updated to match the new behaviour (mismatch surfaces as a diagnostic error, not a silent skip) CR #690 review round 1 * chore: restore Terraform lock files to feat branch state The previous commit accidentally included Terraform provider constraint changes unrelated to the #688 offering-lookup fix. Restore the three lock files to their state on feat/multicloud-web-frontend so this branch stays focused on the purchase-path narrowing changes only.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
Third recurrence of #667 prompted this PR. Original scope was tighten the AWS SDK retry budget; user pushback on the "hanging goroutine" hypothesis correctly identified that SDK-side bounds alone don't cover hangs upstream of the SDK call (STS validation, credential resolution, fan-out scheduling). Scope expanded to a defense-in-depth approach: bound the per-rec budget AND make hangs observable.
Closes #683. Refs #667 #668 #632 #681 #600.
What changed (3 parts)
Part 1 — bound the AWS SDK retry storm
providers/aws/internal/purchasecfgnew package,NewConfig(base aws.Config)returns a copy withRetryMaxAttempts=2+http.Client{Timeout: 15s}. Worst-case wall clock per SDK request: 2 × 15s = 30s.purchasecfg.NewConfig. OpenSearch + Redshift also tighten their STS client used for ARN/tagging.recommendations.NewClientis independent and legitimately needs the SDK default 3-retry budget.Part 2 — bound the per-rec execution itself
context.WithTimeout(ctx, 30s)wrapping the entire purchase chain insideexecuteSinglePurchase(provider construction + STS validation + service-client lookup + details decode + the AWS SDK call). Caps the per-rec budget regardless of which step hangs. 30s matches the part-1 SDK ceiling.mock.On(...)call sites in the purchase tests updated fromctxliteral tomock.MatchedBy(func(c context.Context) bool { _, ok := c.Deadline(); return ok })— positively asserts the deadline propagates.Part 3 — make hangs observable
Every step of the purchase chain now emits an Info log tagged with the execution ID so a single CloudWatch filter (
filter @message like /purchase\[<exec-id>\]/) returns the full timing breakdown. Lines added (one per commit):purchase[<id>]: approve entry (auth=session)/... exit success in 4.2sresolveAccountProvider+ AWS/Azure/GCP sub-resolverspurchase[<id>]: resolved AWS provider for account <X> in 142mspurchase[<id>]: STS GetCallerIdentity for account <X> starting/... returned in 87msfindOfferingIDpagination in all 7 servicespurchase[<id>]: EC2 offering page 1 (token=*) returned in 612ms with 14 offeringsFanOutWithConcurrencyper-goroutinefan-out goroutine entry (account=<X>)/... exit success in 1.8sToken / credential values are masked to the last 8 chars per the
feedback_pii_in_logs.mdmemory entry.Part 4 — Lambda budget headroom
lambda_timeoutbumped 60 -> 300s ingithub-{dev,staging,prod}.tfvars. With the per-rec 30s ceiling from Part 2, a single hung rec is now nowhere near the Lambda ceiling — the bigger budget exists to absorb large multi-rec executions, not to mask hangs.Commit list (10)
df4755556purchasecfg helperfe72342f7wire purchasecfg into 7 service constructors7a899ed0elambda_timeout 60 -> 300a8677e939per-rec context.WithTimeout(30s)3d10f285capprove handler entry/exit logs6804727ebresolveAccountProvider timing logs8f49cf2b2STS + Azure validation call timing logs2919058f5per-page logs in 7 AWS findOfferingID loops09462dd9dFanOutWithConcurrency per-goroutine timinge4ad3eacdtest fixups (mock.MatchedBy for the 42 deadline-aware ctx call sites + 1 savingsplans test parameter)Test plan
go test ./internal/purchase/... ./internal/execution/... ./internal/api/... ./providers/aws/services/... -count=1 -short— 1732 tests pass across 10 packagesgo build ./...— cleanterraform validateinterraform/environments/aws/— cleanfilter @message like /purchase\[<exec-id>\]/ | sort @timestamp ascCross-references
Summary by CodeRabbit