refactor: harden core infrastructure, tenant-manager, and runtime packages#405
refactor: harden core infrastructure, tenant-manager, and runtime packages#405fredcamaral wants to merge 1 commit intomainfrom
Conversation
…kages Applies comprehensive code review findings across all non-systemplane packages. Tenant-manager: - TenantAwareLogger.With/WithGroup now preserve tenant_id injection chain - ShouldSchedule adds hasClient guard and uses deterministic time source - Cert path traversal prevention via ValidateCertPath in core/security.go - TLS/SSL runtime warnings for insecure configurations - Postgres Close() harmonized to snapshot-then-cleanup pattern - SetAuthVerified helper and migration guide for upstream auth assertion - Tenant ID removed from 404 responses to prevent enumeration - HandleFetchError logs close errors instead of swallowing them - New unit tests for configfetch and revalidation packages - Dead logcompat methods removed Core infrastructure: - Panic recovery consolidated to single logPanicWithStack path - Dead logPanic wrapper removed - CORS helper renamed for clarity (isUnrestrictedCORSOrigin) - requestScopedLogger nil guard added - Circuitbreaker nil guard symmetry restored - snapshotConnectState precondition documented - Various test improvements (serial execution docs, edge cases) Build/docs: - REVIEW.md removed (superseded by review process) - MIGRATION_MAP.md updated X-Lerian-Ref: 0x1
WalkthroughThis pull request introduces dependency-injection patterns across multiple subsystems, refactors error-handling and logging infrastructure, and enhances security validation. Key changes include consolidating per-package function dependencies into dependency structs (circuitbreaker, mongo, postgres, redis, rabbitmq), introducing helper packages for tenant revalidation and configuration fetching, standardizing HTTP error responses, replacing reflection-based nil checks with Suggested reviewers
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 25
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
commons/net/http/withLogging_middleware.go (1)
123-129:⚠️ Potential issue | 🟠 MajorGuard the direct logger call in gRPC override path.
This block still calls
mid.Logger.Log(...)directly. A customLogMiddlewareOptioncan setl.Logger = nil, which would panic here despite the new scoped nil guard.Proposed fix
if rid, ok := getValidBodyRequestID(req); ok { if prev := getMetadataID(ctx); prev != "" && prev != rid { - mid.Logger.Log(ctx, log.LevelDebug, "Overriding correlation id from metadata with body request_id", + requestScopedLogger(mid.Logger, requestID).Log(ctx, log.LevelDebug, "Overriding correlation id from metadata with body request_id", log.String("metadata_id", prev), log.String("body_request_id", rid), ) } }As per coding guidelines, "Keep behavior nil-safe and concurrency-safe by default in exported APIs".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/net/http/withLogging_middleware.go` around lines 123 - 129, The gRPC override path calls mid.Logger.Log(...) directly which can panic if a LogMiddlewareOption set Logger to nil; update the block in the withLogging middleware (the code using getValidBodyRequestID and getMetadataID) to guard the logger before calling Log (e.g., check mid.Logger != nil or use the existing safe logging helper used elsewhere), so the call becomes nil-safe; ensure you reference and protect the mid.Logger.Log invocation and keep behavior consistent with other guarded log calls so exported APIs remain nil-safe and concurrency-safe.commons/rabbitmq/rabbitmq.go (1)
170-176:⚠️ Potential issue | 🟡 MinorDocumentation inconsistency: missing
applyDefaults()precondition.This method uses
rc.deps.isConnClosedandrc.deps.isChannelClosed, which requireapplyDefaults()to have been called. The comment documents the lock requirement but not the deps precondition, unlikesnapshotConnectState(lines 195-197) which explicitly documents both.📝 Suggested documentation update
// isFullyConnected reports whether the connection and channel are both open. -// The caller MUST hold rc.mu. +// The caller MUST hold rc.mu AND must have called applyDefaults() before +// this method to ensure deps.isConnClosed and deps.isChannelClosed are set. func (rc *RabbitMQConnection) isFullyConnected() bool {🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/rabbitmq/rabbitmq.go` around lines 170 - 176, The comment for isFullyConnected is missing the precondition that rc.deps must be initialized via applyDefaults(); update the doc comment above func (rc *RabbitMQConnection) isFullyConnected() to state that the caller MUST hold rc.mu and MUST have called applyDefaults() (or that rc.deps has been initialized), since the method calls rc.deps.isConnClosed and rc.deps.isChannelClosed; keep wording consistent with snapshotConnectState's comment to make the dependency explicit.commons/runtime/recover_test.go (1)
225-235:⚠️ Potential issue | 🟡 MinorThis renamed test still doesn't verify the panic value.
Lines 233-234 only prove that an error log happened. A regression that stopped attaching the
valuefield would still pass, so this no longer covers whatTestLogPanicWithStack_LogsPanicValueclaims. Please assert the emittedvaluefield fromlogger.errorCallsin the non-production path this test exercises.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/runtime/recover_test.go` around lines 225 - 235, The test TestLogPanicWithStack_LogsPanicValue currently only checks that an error was logged; update it to also assert that the logged entry contains the expected panic value: after calling logPanicWithStack(logger, "test-handler", "panic value", []byte("fake stack")), inspect logger.errorCalls (the non-production path) and add an assertion that the record's "value" field equals "panic value" (or otherwise matches the passed value), so the test verifies the panic payload was attached; reference logPanicWithStack, logger.errorCalls and logger.wasPanicLogged() to locate the relevant assertions to extend.commons/tenant-manager/postgres/manager.go (1)
966-999:⚠️ Potential issue | 🔴 Critical
Closecan still leak a resolver through the reconnect path.An in-flight
reconnectPostgrescan passcanStorePostgresConnectionbefore Line 970 flipsp.closed, then win the second lock inswapPostgresConnectionafter this snapshot/cleanup phase. That repopulatesp.connectionsafterClosehas already decided what to close, leaving a live resolver outside the shutdown loop.As per coding guidelines, Keep behavior nil-safe and concurrency-safe by default in exported APIs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@commons/tenant-manager/postgres/manager.go` around lines 966 - 999, Close currently can race with reconnectPostgres: reconnectPostgres may call canStorePostgresConnection before p.closed is set and later win swapPostgresConnection, repopulating p.connections after Close snapped and will not be closed. Fix by making shutdown visible to reconnections: ensure Close sets a shutdown flag that canStorePostgresConnection and swapPostgresConnection check atomically (or via a closed shutdown channel) and refuse to store/swap when p.closed is true; keep setting p.closed under p.mu in Close early, and update reconnectPostgres to handle a rejected store (close the new connection and return) so no resolver can be re-added after Close; reference functions/fields: Close, reconnectPostgres, canStorePostgresConnection, swapPostgresConnection, p.closed, p.connections, p.revalidateWG.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@commons/circuitbreaker/healthchecker.go`:
- Around line 184-188: The recovery routine drops the caller context by using
context.Background(); change attemptServiceRecovery to accept a parent context
(e.g., func (hc *healthChecker) attemptServiceRecovery(parentCtx
context.Context, serviceName string, healthCheckFn HealthCheckFunc) bool),
create the timeout from that parent context with ctx, cancel :=
context.WithTimeout(parentCtx, hc.checkTimeout) (and defer cancel()), and pass
the resulting ctx into healthCheckFn so cancellations/deadlines propagate;
update all call sites of attemptServiceRecovery to pass the caller/loop context.
In `@commons/net/http/error.go`:
- Around line 17-20: RespondError currently passes the original status into
Respond even though buildErrorResponse normalizes and sets ErrorResponse.Code;
change RespondError to call buildErrorResponse first, then pass the resulting
response's Code (e.g., resp.Code or ErrorResponse.Code) into Respond so the
normalized code from buildErrorResponse is used (locate RespondError,
buildErrorResponse and Respond symbols to implement this).
In `@commons/net/http/withCORS.go`:
- Around line 165-178: Add a defensive nil guard at the start of
buildFiberCORSConfig to handle a nil cfg consistently: check if cfg == nil and
return a safe default cors.Config (e.g., zero-value with empty slices and
AllowCredentials false) to match the pattern used by
warnCORSConfiguration/enforceCORSRuntimeConfig/guardCORSCredentials; keep
existing logic for cfg.denyAllOrigins and setting AllowOriginsFunc when present.
In `@commons/net/http/withTelemetry.go`:
- Around line 54-56: The early-return when telemetryTracer(effectiveTelemetry)
returns !ok must be removed so metrics setup still runs: instead of returning
from the middleware in the tracer-missing branch, set tracer to nil (or a no-op)
and continue to call commons.ContextWithMetricFactory(...) and
tm.collectMetrics(ctx), only skipping span creation/StartSpan logic; apply the
same change to the second branch around lines 144-146 so that when
TracerProvider is nil metrics publishing and metric-factory context propagation
still occur while span creation is bypassed.
In `@commons/opentelemetry/otel.go`:
- Around line 297-299: shutdownAll currently discards errors from
shutdownComponents, hiding cleanup failures; change shutdownAll(ctx, components)
to return the error from shutdownComponents (use errors.Join to combine multiple
errors if applicable) instead of swallowing it, and update all callers that
currently ignore shutdownAll to combine its returned error with the primary init
error (e.g., using errors.Join(fmt.Errorf("...: %w", initErr), cleanupErr)) so
cleanup failures propagate; refer to shutdownAll and shutdownComponents to
locate the changes.
In `@commons/outbox/postgres/db_test.go`:
- Around line 93-110: The test currently only checks that both
resolverClientStub.resolveFn and primaryFn ran, but not their call order; update
the test for resolvePrimaryDB to track call order (e.g., an integer counter or
slice of events) inside resolverClientStub by incrementing/recording when
resolveFn runs and when primaryFn runs, then assert that the resolver's recorded
index is less than the primary's recorded index to enforce Resolver() is called
before Primary(); reference resolvePrimaryDB and resolverClientStub when making
this change.
- Around line 113-121: The test TestResolvePrimaryDB_NilContextUsesBackground
currently calls primaryDBProviderFunc which doesn't accept a context so it never
verifies the nil-to-Background fallback; change the test to use a
resolverClientStub with its resolveFn (the stub for
resolverClient.Resolve/resolveFn) as the provider passed into resolvePrimaryDB
so you can capture the ctx argument, assert that resolveFn receives a non-nil
context (e.g., not nil and equals context.Background() or at least not nil), and
still return the expected *sql.DB and nil error; update references to Primary()
usage if any to use the resolverClientStub.resolveFn to exercise the context
propagation.
In `@commons/postgres/postgres_test.go`:
- Around line 1394-1404: The test's injected createResolver stub currently
recovers from its own panic, masking whether
newDefaultClientDeps().createResolver (the production panic-recovery wrapper
used by Connect) actually handles panics; change the stub inside
defaultClientDeps in postgres_test.go so it does not recover but instead panics
directly (i.e., remove the defer/recover block and simply panic("dbresolver
exploded")), so the real wrapper in
newDefaultClientDeps().createResolver/Connect can observe and handle the panic;
alternatively, call the wrapper helper directly from the test to assert its
recovery behavior.
In `@commons/redis/redis_test.go`:
- Around line 571-586: The reconnect closure in clientDeps currently ignores its
*Client arg and only stores a sentinel, so update the closure used in the test
to read the refreshed token from the passed *Client (e.g., access the client's
token field or getter inside the reconnect func) and call
tokenAtReconnect.Store(token) so the test verifies applyTokenAndReconnect
installed the new token before reconnect runs; then restore the
require.Eventually assertion to check tokenAtReconnect equals
"new-refreshed-token".
In `@commons/runtime/goroutine.go`:
- Around line 68-70: The nil-callback warning currently uses
context.Background() which strips request-scoped fields; change warnNilCallback
to accept a context.Context parameter and update its signature/usages so
SafeGoWithContextAndComponent calls warnNilCallback(ctx, logger, ...) to
preserve the caller context and tracing; leave SafeGo unchanged so it continues
to call warnNilCallback(context.Background(), ...) for the non-context-aware
path and update any other call sites to match the new signature (refer to
functions SafeGoWithContextAndComponent, SafeGo, and warnNilCallback).
In `@commons/server/shutdown.go`:
- Around line 347-367: The current waitForShutdownTrigger implementation stops
listening for OS signals when sm.shutdownChan is non-nil, causing the process to
ignore SIGINT/SIGTERM; change ServerManager.waitForShutdownTrigger to always
create and register an OS signal channel (signal.Notify) and include that
channel alongside sm.shutdownChan and sm.startupErrors in the select so all
three are watched simultaneously. Concretely, in waitForShutdownTrigger create
signals := make(chan os.Signal, 1) and defer signal.Stop(signals), call
signal.Notify(signals, os.Interrupt, syscall.SIGTERM), then use a single select
with cases for <-signals (return nil), <-sm.shutdownChan (return nil) and err :=
<-sm.startupErrors (return sm.logStartupError(err)); if sm.shutdownChan may be
nil, use a local variable that is nil when unset so the select still works.
- Around line 424-437: runShutdownHooks currently calls each shutdown hook
synchronously using hookCtx for cooperative cancellation, so a misbehaving hook
can block shutdown; modify runShutdownHooks to invoke each hook in a separate
goroutine, send the hook's returned error on a channel, and then select between
receiving that result and a time.After or hookCtx.Done() to enforce
sm.shutdownTimeout; on timeout log the failure with sm.logger.Log (include
"hook_index" and context.Err or a timeout message) and ensure hookCancel() is
still called to release resources, and drain/ignore the goroutine result if it
arrives after timeout.
- Around line 416-420: The logger is being synced too early: move the
sm.syncLogger() call to after sm.shutdownLicense() and after the final
sm.logInfo("Graceful shutdown completed") so that all shutdown logs (including
the completion message) are flushed; update the shutdown sequence in the
shutdown manager to call sm.shutdownLicense(), then sm.logInfo(...), then
sm.syncLogger() (referencing the methods sm.shutdownLicense, sm.logInfo, and
sm.syncLogger to locate the change).
In `@commons/tenant-manager/core/security.go`:
- Around line 64-80: The loop currently compares resolved paths against
extraAllowedDirs as-is, allowing a prefix bypass if callers pass a dir without a
trailing slash; before merging extraAllowedDirs into allowed (where
DefaultAllowedCertDirs already have trailing slashes), normalize each entry in
extraAllowedDirs by cleaning and ensuring a trailing "/" (use filepath.Clean on
the entry, then if it doesn't end with "/" append "/"), then append the
normalized entries to DefaultAllowedCertDirs and continue the existing
EvalSymlinks logic; update references to "allowed", "extraAllowedDirs", and the
loop that checks "resolved" so comparisons are always against normalized
directories.
In `@commons/tenant-manager/internal/configfetch/configfetch.go`:
- Around line 33-42: Replace string-formatted logs with structured-field
logging: instead of logger.WarnCtx(ctx, fmt.Sprintf(...)) and
logger.ErrorCtx(ctx, fmt.Sprintf(...)) call logger.WithFields(...).WarnCtx/
ErrorCtx and pass tenantID, suspended.Status and the error as fields; update the
related libOpentelemetry.HandleSpanBusinessErrorEvent/ HandleSpanError calls
remain but remove redundant fmt.Sprintf usage so observability uses fields
(e.g., reference the logger.WithFields, WarnCtx, ErrorCtx, suspended.Status,
tenantID, and err identifiers).
In `@commons/tenant-manager/internal/logcompat/logger.go`:
- Around line 66-80: Add a short comment above the Logger methods logArgs and
logf stating these helpers produce unstructured messages (they use
fmt.Sprint/fmt.Sprintf) and callers should prefer the structured logging APIs
such as Log(ctx, level, msg, fields...) or WithFields(...).Info(msg) for
structured observability; mention logArgs and logf by name so reviewers can find
them easily and keep the note concise.
In `@commons/tenant-manager/internal/revalidation/revalidation.go`:
- Around line 32-36: RecoverPanic currently only logs the recovered value;
change it to capture and log the stack trace as well by delegating to the
consolidated panic logger (e.g., call logPanicWithStack or build a message using
runtime/debug.Stack()) so the panic origin is recorded; update the RecoverPanic
function to obtain the stack (debug.Stack()) and include it in the logger call
(or call logPanicWithStack(logger, tenantID, recovered, stack)) to produce a
single log entry with both the recovered value and full stack trace.
In `@commons/tenant-manager/log/tenant_logger.go`:
- Around line 30-32: The tenant_id may be appended twice because
tenant_logger.go unconditionally appends tmcore.GetTenantID(ctx) into the fields
slice; update the logic (where fields is built and tmcore.GetTenantID(ctx) is
called) to first check whether a field with key "tenant_id" already exists in
fields and only append the context tenant_id if not present. Locate the block
that calls tmcore.GetTenantID(ctx) and appends log.String("tenant_id", tenantID)
and add a simple presence check over fields (by examining each field's key/name)
to deduplicate before appending.
In `@commons/tenant-manager/middleware/tenant.go`:
- Around line 105-113: SetAuthVerified currently sets the "auth_verified" local
but is never consulted, so WithTenantDB (and MultiPoolMiddleware) still call
ParseUnverified and accept forged bearer tokens; update WithTenantDB to check
c.Locals("auth_verified") (and any documented fallback locals) and if true call
the verified parsing path (e.g., Parse or ParseVerified) instead of
ParseUnverified, otherwise preserve existing fallback behavior; ensure the same
check is applied where tenant resolution happens and keep SetAuthVerified as the
documented helper.
In `@commons/tenant-manager/mongo/manager.go`:
- Around line 775-792: The code is holding the manager lock (p.mu) while calling
p.disconnectMongo (in the manager closed branch, cacheConnection/excess
connection branch, and evictLRU path), causing network I/O under the lock; fix
by capturing the connection value(s) (e.g., conn or cached.DB/connection entries
from p.connections) and removing/marking them from the map while still under
p.mu, then release the lock and call p.disconnectMongo(ctx, conn, ...) outside
the critical section; ensure nil-safety by checking captured connection != nil
before calling disconnect and preserve the existing logging behavior (use the
same log message and tenantID) after releasing the lock.
- Around line 1106-1112: The URI-based TLS handling path isn't validating
certificate file URIs, so config fields embedded in cfg.URI (e.g., tlsCAFile,
tlsCertificateKeyFile) can point at arbitrary files; extend the same cert-path
hardening used in buildTLSConfigFromFiles by validating any TLS-related file
fields before using them in the URI path: call core.ValidateCertPath on
cfg.TLSCAFile, cfg.TLSCertificateKeyFile (and any other TLS file fields such as
cfg.TLSCertFile/cfg.TLSKeyFile) in the function that parses/uses cfg.URI (the
URI-based TLS handler) and return an error if validation fails so the driver
never receives unvalidated file paths.
In `@commons/tenant-manager/postgres/manager.go`:
- Around line 1095-1097: The PostgreSQL DSN builder currently forwards
cfg.SSLRootCert, cfg.SSLCert, and cfg.SSLKey even when sslmode allows TLS,
leaving a filesystem-path attack surface; update the code that constructs the
DSN (the sslmode check and the subsequent block that appends
sslrootcert/sslcert/sslkey) to validate each path (cfg.SSLRootCert, cfg.SSLCert,
cfg.SSLKey) before adding: ensure the path is non-empty, points to an existing
regular file, and is readable (reject paths with traversal or other invalid
characteristics), return an error if validation fails, and apply the same
validation logic to the other occurrence noted around the 1113–1123 region so no
unvalidated filesystem paths are included in the DSN.
In `@commons/tenant-manager/rabbitmq/manager.go`:
- Around line 340-356: storeOrDiscardRabbitMQConnectionLocked currently calls
closeRabbitMQConn while holding the manager mutex, which can block; change to
the snapshot-then-close pattern used in other managers: inside
storeOrDiscardRabbitMQConnectionLocked (and the other affected blocks referenced
around the same function), determine whether to close the incoming conn or
reuse/return a cached connection while still holding p.mu, but do NOT call
closeRabbitMQConn under the lock—instead set a local variable to the connection
that needs closing, release the lock, then call closeRabbitMQConn(connToClose,
...) outside the mutex; update references to closeRabbitMQConn in the same
function and the other identified regions (the branches around returns where
closeRabbitMQConn is invoked) accordingly to avoid holding p.mu during network
I/O.
In `@Makefile`:
- Line 272: The find command that populates the dirs variable uses -not -path
'./vendor/*', which doesn't exclude vendor dirs under the scoped directory
(e.g., ./commons/vendor); update the -not -path pattern in the dirs assignment
(the find invocation that sets dirs) to exclude vendor directories anywhere
under the scoped tree (for example use a wildcard path like '*/vendor/*' or
explicitly './commons/vendor/*') so vendored files are not discovered; apply the
same pattern change to the identical find usage that appears later in the file.
In `@MIGRATION_MAP.md`:
- Around line 40-46: Update MIGRATION_MAP.md to include explicit old→new symbol
mappings for the HTTP helpers section: list each renamed/removed symbol from the
diff (e.g., old status-specific helpers → consolidated Respond, RespondStatus,
RespondError, RenderError, FiberErrorHandler; legacy reverse proxy helpers →
ServeReverseProxy and ReverseProxyPolicy) and add the CORS helper rename plus
any removed dead wrapper symbols referenced in this PR; ensure you add concrete
mappings (oldName -> newName or removed) for all touched symbols and mirror the
same entries for the other location referenced (lines 84-85) so migrations are
actionable.
---
Outside diff comments:
In `@commons/net/http/withLogging_middleware.go`:
- Around line 123-129: The gRPC override path calls mid.Logger.Log(...) directly
which can panic if a LogMiddlewareOption set Logger to nil; update the block in
the withLogging middleware (the code using getValidBodyRequestID and
getMetadataID) to guard the logger before calling Log (e.g., check mid.Logger !=
nil or use the existing safe logging helper used elsewhere), so the call becomes
nil-safe; ensure you reference and protect the mid.Logger.Log invocation and
keep behavior consistent with other guarded log calls so exported APIs remain
nil-safe and concurrency-safe.
In `@commons/rabbitmq/rabbitmq.go`:
- Around line 170-176: The comment for isFullyConnected is missing the
precondition that rc.deps must be initialized via applyDefaults(); update the
doc comment above func (rc *RabbitMQConnection) isFullyConnected() to state that
the caller MUST hold rc.mu and MUST have called applyDefaults() (or that rc.deps
has been initialized), since the method calls rc.deps.isConnClosed and
rc.deps.isChannelClosed; keep wording consistent with snapshotConnectState's
comment to make the dependency explicit.
In `@commons/runtime/recover_test.go`:
- Around line 225-235: The test TestLogPanicWithStack_LogsPanicValue currently
only checks that an error was logged; update it to also assert that the logged
entry contains the expected panic value: after calling logPanicWithStack(logger,
"test-handler", "panic value", []byte("fake stack")), inspect logger.errorCalls
(the non-production path) and add an assertion that the record's "value" field
equals "panic value" (or otherwise matches the passed value), so the test
verifies the panic payload was attached; reference logPanicWithStack,
logger.errorCalls and logger.wasPanicLogged() to locate the relevant assertions
to extend.
In `@commons/tenant-manager/postgres/manager.go`:
- Around line 966-999: Close currently can race with reconnectPostgres:
reconnectPostgres may call canStorePostgresConnection before p.closed is set and
later win swapPostgresConnection, repopulating p.connections after Close snapped
and will not be closed. Fix by making shutdown visible to reconnections: ensure
Close sets a shutdown flag that canStorePostgresConnection and
swapPostgresConnection check atomically (or via a closed shutdown channel) and
refuse to store/swap when p.closed is true; keep setting p.closed under p.mu in
Close early, and update reconnectPostgres to handle a rejected store (close the
new connection and return) so no resolver can be re-added after Close; reference
functions/fields: Close, reconnectPostgres, canStorePostgresConnection,
swapPostgresConnection, p.closed, p.connections, p.revalidateWG.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: e80ce8de-8e79-4fae-b5e4-ff417093d20a
📒 Files selected for processing (51)
.golangci.ymlMIGRATION_MAP.mdMakefileREVIEW.mdcommons/circuitbreaker/healthchecker.gocommons/circuitbreaker/manager.gocommons/mongo/mongo.gocommons/mongo/mongo_test.gocommons/net/http/error.gocommons/net/http/handler.gocommons/net/http/handler_helpers.gocommons/net/http/matcher_response.gocommons/net/http/response.gocommons/net/http/withBasicAuth.gocommons/net/http/withCORS.gocommons/net/http/withLogging_middleware.gocommons/net/http/withTelemetry.gocommons/net/http/withTelemetry_helpers.gocommons/opentelemetry/metrics/builders.gocommons/opentelemetry/metrics/metrics.gocommons/opentelemetry/otel.gocommons/outbox/config.gocommons/outbox/dispatcher.gocommons/outbox/postgres/db.gocommons/outbox/postgres/db_test.gocommons/postgres/postgres.gocommons/postgres/postgres_test.gocommons/rabbitmq/rabbitmq.gocommons/redis/redis.gocommons/redis/redis_test.gocommons/runtime/goroutine.gocommons/runtime/recover.gocommons/runtime/recover_helpers.gocommons/runtime/recover_test.gocommons/server/shutdown.gocommons/tenant-manager/core/security.gocommons/tenant-manager/internal/configfetch/configfetch.gocommons/tenant-manager/internal/configfetch/configfetch_test.gocommons/tenant-manager/internal/logcompat/logger.gocommons/tenant-manager/internal/revalidation/revalidation.gocommons/tenant-manager/internal/revalidation/revalidation_test.gocommons/tenant-manager/log/tenant_logger.gocommons/tenant-manager/log/tenant_logger_test.gocommons/tenant-manager/middleware/tenant.gocommons/tenant-manager/middleware/tenant_errors.gocommons/tenant-manager/middleware/tenant_test.gocommons/tenant-manager/mongo/manager.gocommons/tenant-manager/mongo/manager_test.gocommons/tenant-manager/postgres/manager.gocommons/tenant-manager/rabbitmq/manager.gocommons/tenant-manager/rabbitmq/manager_test.go
💤 Files with no reviewable changes (1)
- REVIEW.md
| func (hc *healthChecker) attemptServiceRecovery(serviceName string, healthCheckFn HealthCheckFunc) bool { | ||
| hc.logger.Log(context.Background(), log.LevelInfo, "attempting to heal service", log.String("service", serviceName), log.String("reason", "circuit breaker open")) | ||
|
|
||
| ctx, cancel := context.WithTimeout(context.Background(), hc.checkTimeout) | ||
| err := healthCheckFn(ctx) |
There was a problem hiding this comment.
Propagate caller context into recovery checks
attemptServiceRecovery() creates its timeout from context.Background() (Line 187), so cancellation/deadlines from the health-check loop lifecycle are dropped. This can make shutdown/cancellation less responsive under slow checks.
Suggested fix
-func (hc *healthChecker) performHealthChecks() {
+func (hc *healthChecker) performHealthChecks(ctx context.Context) {
services := hc.snapshotServices()
@@
- if hc.attemptServiceRecovery(serviceName, healthCheckFn) {
+ if hc.attemptServiceRecovery(ctx, serviceName, healthCheckFn) {
recoveredCount++
}
}
}
@@
-func (hc *healthChecker) attemptServiceRecovery(serviceName string, healthCheckFn HealthCheckFunc) bool {
+func (hc *healthChecker) attemptServiceRecovery(ctx context.Context, serviceName string, healthCheckFn HealthCheckFunc) bool {
hc.logger.Log(context.Background(), log.LevelInfo, "attempting to heal service", log.String("service", serviceName), log.String("reason", "circuit breaker open"))
- ctx, cancel := context.WithTimeout(context.Background(), hc.checkTimeout)
- err := healthCheckFn(ctx)
-
- cancel()
+ checkCtx, cancel := context.WithTimeout(ctx, hc.checkTimeout)
+ defer cancel()
+ err := healthCheckFn(checkCtx)- case <-ticker.C:
- hc.performHealthChecks()
+ case <-ticker.C:
+ hc.performHealthChecks(ctx)
@@
- hc.checkServiceHealth(serviceName)
+ hc.checkServiceHealth(ctx, serviceName)
@@
-func (hc *healthChecker) checkServiceHealth(serviceName string) {
+func (hc *healthChecker) checkServiceHealth(ctx context.Context, serviceName string) {
@@
- hc.attemptServiceRecovery(serviceName, healthCheckFn)
+ hc.attemptServiceRecovery(ctx, serviceName, healthCheckFn)
}Based on learnings: “In all files under commons/circuitbreaker, prefer propagating context with timeouts … verify context propagation and timeout handling.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/circuitbreaker/healthchecker.go` around lines 184 - 188, The recovery
routine drops the caller context by using context.Background(); change
attemptServiceRecovery to accept a parent context (e.g., func (hc
*healthChecker) attemptServiceRecovery(parentCtx context.Context, serviceName
string, healthCheckFn HealthCheckFunc) bool), create the timeout from that
parent context with ctx, cancel := context.WithTimeout(parentCtx,
hc.checkTimeout) (and defer cancel()), and pass the resulting ctx into
healthCheckFn so cancellations/deadlines propagate; update all call sites of
attemptServiceRecovery to pass the caller/loop context.
| // RespondError writes a structured error response using the ErrorResponse schema. | ||
| func RespondError(c *fiber.Ctx, status int, title, message string) error { | ||
| return Respond(c, status, buildErrorResponse(status, title, message)) | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Minor: Redundant normalization – consider using the built response's code.
buildErrorResponse normalizes the status internally, setting ErrorResponse.Code to the normalized value. However, RespondError passes the original status to Respond, which normalizes it again. While functionally correct (both normalize the same way), this is semantically cleaner if the built response's code is used:
♻️ Suggested improvement
func RespondError(c *fiber.Ctx, status int, title, message string) error {
- return Respond(c, status, buildErrorResponse(status, title, message))
+ resp := buildErrorResponse(status, title, message)
+ return Respond(c, resp.Code, resp)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/net/http/error.go` around lines 17 - 20, RespondError currently
passes the original status into Respond even though buildErrorResponse
normalizes and sets ErrorResponse.Code; change RespondError to call
buildErrorResponse first, then pass the resulting response's Code (e.g.,
resp.Code or ErrorResponse.Code) into Respond so the normalized code from
buildErrorResponse is used (locate RespondError, buildErrorResponse and Respond
symbols to implement this).
| func buildFiberCORSConfig(cfg *corsRuntimeConfig) cors.Config { | ||
| config := cors.Config{ | ||
| AllowOrigins: origins, | ||
| AllowMethods: commons.GetenvOrDefault("ACCESS_CONTROL_ALLOW_METHODS", defaultAccessControlAllowMethods), | ||
| AllowHeaders: commons.GetenvOrDefault("ACCESS_CONTROL_ALLOW_HEADERS", defaultAccessControlAllowHeaders), | ||
| ExposeHeaders: commons.GetenvOrDefault("ACCESS_CONTROL_EXPOSE_HEADERS", defaultAccessControlExposeHeaders), | ||
| AllowCredentials: allowCredentials, | ||
| AllowOrigins: cfg.origins, | ||
| AllowMethods: cfg.allowMethods, | ||
| AllowHeaders: cfg.allowHeaders, | ||
| ExposeHeaders: cfg.exposeHeaders, | ||
| AllowCredentials: cfg.allowCredentials, | ||
| } | ||
| if denyAllOrigins { | ||
| if cfg.denyAllOrigins { | ||
| config.AllowOriginsFunc = func(string) bool { return false } | ||
| } | ||
|
|
||
| return cors.New(config) | ||
| return config | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider adding nil guard for consistency.
Unlike sibling helpers (warnCORSConfiguration, enforceCORSRuntimeConfig, guardCORSCredentials), this function lacks a nil guard. While the current call path guarantees non-nil, adding a guard would maintain defensive consistency.
🛡️ Optional: Add nil guard for consistency
func buildFiberCORSConfig(cfg *corsRuntimeConfig) cors.Config {
+ if cfg == nil {
+ return cors.Config{}
+ }
config := cors.Config{
AllowOrigins: cfg.origins,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/net/http/withCORS.go` around lines 165 - 178, Add a defensive nil
guard at the start of buildFiberCORSConfig to handle a nil cfg consistently:
check if cfg == nil and return a safe default cors.Config (e.g., zero-value with
empty slices and AllowCredentials false) to match the pattern used by
warnCORSConfiguration/enforceCORSRuntimeConfig/guardCORSCredentials; keep
existing logic for cfg.denyAllOrigins and setting AllowOriginsFunc when present.
| tracer, ok := telemetryTracer(effectiveTelemetry) | ||
| if !ok { | ||
| return c.Next() |
There was a problem hiding this comment.
Keep the metrics path active when tracing is unavailable.
Line 54 and Line 144 now start early-return branches before commons.ContextWithMetricFactory(...) and tm.collectMetrics(ctx) run. If MetricsFactory is configured but TracerProvider is nil, both exported middleware paths silently stop publishing request metrics and stop propagating metric-factory context downstream. Only skip span creation here; the metrics setup path should still execute.
As per coding guidelines, "Preserve v4 public API contracts unless a task explicitly asks for breaking changes".
Also applies to: 144-146
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/net/http/withTelemetry.go` around lines 54 - 56, The early-return
when telemetryTracer(effectiveTelemetry) returns !ok must be removed so metrics
setup still runs: instead of returning from the middleware in the tracer-missing
branch, set tracer to nil (or a no-op) and continue to call
commons.ContextWithMetricFactory(...) and tm.collectMetrics(ctx), only skipping
span creation/StartSpan logic; apply the same change to the second branch around
lines 144-146 so that when TracerProvider is nil metrics publishing and
metric-factory context propagation still occur while span creation is bypassed.
| func shutdownAll(ctx context.Context, components []shutdownable) { | ||
| for _, c := range components { | ||
| if isNilShutdownable(c) { | ||
| continue | ||
| } | ||
|
|
||
| _ = c.Shutdown(ctx) | ||
| } | ||
| _ = shutdownComponents(ctx, components) | ||
| } |
There was a problem hiding this comment.
Propagate rollback shutdown failures.
shutdownAll now drops every error returned by shutdownComponents, so a partial exporter initialization can fail with only the primary error while hiding cleanup failures and leaked resources. Please return errors.Join(...) here and combine it with the init error at the call sites instead of discarding it.
🔧 Suggested direction
-func shutdownAll(ctx context.Context, components []shutdownable) {
- _ = shutdownComponents(ctx, components)
+func shutdownAll(ctx context.Context, components []shutdownable) error {
+ return errors.Join(shutdownComponents(ctx, components)...)
}if cleanupErr := shutdownAll(ctx, cleanups); cleanupErr != nil {
return nil, errors.Join(fmt.Errorf("can't initialize metric exporter: %w", err), cleanupErr)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/opentelemetry/otel.go` around lines 297 - 299, shutdownAll currently
discards errors from shutdownComponents, hiding cleanup failures; change
shutdownAll(ctx, components) to return the error from shutdownComponents (use
errors.Join to combine multiple errors if applicable) instead of swallowing it,
and update all callers that currently ignore shutdownAll to combine its returned
error with the primary init error (e.g., using errors.Join(fmt.Errorf("...: %w",
initErr), cleanupErr)) so cleanup failures propagate; refer to shutdownAll and
shutdownComponents to locate the changes.
| func buildTLSConfigFromFiles(cfg *core.MongoDBConfig, logger log.Logger, tenantID string) (*tls.Config, error) { | ||
| // Validate all certificate file paths before reading them. | ||
| for _, certPath := range []string{cfg.TLSCertFile, cfg.TLSKeyFile, cfg.TLSCAFile} { | ||
| if err := core.ValidateCertPath(certPath); err != nil { | ||
| return nil, fmt.Errorf("certificate path validation failed: %w", err) | ||
| } | ||
| } |
There was a problem hiding this comment.
The cert-path hardening still misses URI-based TLS configs.
buildTLSConfigFromFiles only runs when cert and key are separate files. Combined-PEM / CA-only setups still go through the URI path, and raw cfg.URI can embed tlsCAFile / tlsCertificateKeyFile directly. Those paths never hit core.ValidateCertPath, so the driver can still be pointed at arbitrary files.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/tenant-manager/mongo/manager.go` around lines 1106 - 1112, The
URI-based TLS handling path isn't validating certificate file URIs, so config
fields embedded in cfg.URI (e.g., tlsCAFile, tlsCertificateKeyFile) can point at
arbitrary files; extend the same cert-path hardening used in
buildTLSConfigFromFiles by validating any TLS-related file fields before using
them in the URI path: call core.ValidateCertPath on cfg.TLSCAFile,
cfg.TLSCertificateKeyFile (and any other TLS file fields such as
cfg.TLSCertFile/cfg.TLSKeyFile) in the function that parses/uses cfg.URI (the
URI-based TLS handler) and return an error if validation fails so the driver
never receives unvalidated file paths.
| if sslmode == sslModeDisable && (cfg.SSLRootCert != "" || cfg.SSLCert != "" || cfg.SSLKey != "") { | ||
| return "", fmt.Errorf("sslmode is %q but SSL certificate parameters are set (sslrootcert=%q, sslcert=%q, sslkey=%q); "+ | ||
| "either remove the certificate paths or use a TLS-enabled sslmode", sslmode, cfg.SSLRootCert, cfg.SSLCert, cfg.SSLKey) |
There was a problem hiding this comment.
Validate PostgreSQL TLS file paths before adding them to the DSN.
The new sslmode=disable guard still forwards sslrootcert, sslcert, and sslkey straight from tenant config into pgx/libpq. That leaves the same filesystem-path attack surface that the MongoDB/RabbitMQ changes are trying to close in this PR.
🐛 Suggested fix
func buildConnectionString(cfg *core.PostgreSQLConfig) (string, error) {
if cfg == nil {
return "", fmt.Errorf("postgres.buildConnectionString: %w", core.ErrNilConfig)
}
+
+ for _, certPath := range []string{cfg.SSLRootCert, cfg.SSLCert, cfg.SSLKey} {
+ if err := core.ValidateCertPath(certPath); err != nil {
+ return "", fmt.Errorf("TLS certificate path validation failed: %w", err)
+ }
+ }
sslmode := cfg.SSLModeAlso applies to: 1113-1123
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/tenant-manager/postgres/manager.go` around lines 1095 - 1097, The
PostgreSQL DSN builder currently forwards cfg.SSLRootCert, cfg.SSLCert, and
cfg.SSLKey even when sslmode allows TLS, leaving a filesystem-path attack
surface; update the code that constructs the DSN (the sslmode check and the
subsequent block that appends sslrootcert/sslcert/sslkey) to validate each path
(cfg.SSLRootCert, cfg.SSLCert, cfg.SSLKey) before adding: ensure the path is
non-empty, points to an existing regular file, and is readable (reject paths
with traversal or other invalid characteristics), return an error if validation
fails, and apply the same validation logic to the other occurrence noted around
the 1113–1123 region so no unvalidated filesystem paths are included in the DSN.
| func (p *Manager) storeOrDiscardRabbitMQConnectionLocked(tenantID string, conn *amqp.Connection) (*amqp.Connection, bool, error) { | ||
| if p.closed { | ||
| if closeErr := p.closeRabbitMQConn(conn, "failed to close RabbitMQ connection after manager closed for tenant %s", tenantID); closeErr != nil { | ||
| p.logger.Warnf("failed to close RabbitMQ connection after manager closed for tenant %s: %v", tenantID, closeErr) | ||
| } | ||
|
|
||
| p.mu.Unlock() | ||
| return nil, false, core.ErrManagerClosed | ||
| } | ||
|
|
||
| if cached, ok := p.connections[tenantID]; ok && !cached.IsClosed() { | ||
| p.lastAccessed[tenantID] = time.Now() | ||
| if closeErr := p.closeRabbitMQConn(conn, "failed to close excess RabbitMQ connection for tenant %s", tenantID); closeErr != nil { | ||
| p.logger.Warnf("failed to close excess RabbitMQ connection for tenant %s; reusing cached connection: %v", tenantID, closeErr) | ||
| } | ||
|
|
||
| logger.Infof("RabbitMQ connection created: tenant=%s, vhost=%s", tenantID, rabbitConfig.VHost) | ||
| return cached, true, nil | ||
| } |
There was a problem hiding this comment.
Don’t close AMQP connections while p.mu is held.
The new helper and lifecycle paths now call closeRabbitMQConn inside the manager lock. amqp.Connection.Close can block on broker/network I/O, so one slow close serializes unrelated GetConnection/CloseConnection/shutdown work behind the mutex. This should follow the same snapshot-then-close pattern used in the MongoDB/PostgreSQL managers.
As per coding guidelines, Keep behavior nil-safe and concurrency-safe by default in exported APIs.
Also applies to: 377-385, 412-427, 441-455
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@commons/tenant-manager/rabbitmq/manager.go` around lines 340 - 356,
storeOrDiscardRabbitMQConnectionLocked currently calls closeRabbitMQConn while
holding the manager mutex, which can block; change to the snapshot-then-close
pattern used in other managers: inside storeOrDiscardRabbitMQConnectionLocked
(and the other affected blocks referenced around the same function), determine
whether to close the incoming conn or reuse/return a cached connection while
still holding p.mu, but do NOT call closeRabbitMQConn under the lock—instead set
a local variable to the connection that needs closing, release the lock, then
call closeRabbitMQConn(connToClose, ...) outside the mutex; update references to
closeRabbitMQConn in the same function and the other identified regions (the
branches around returns where closeRabbitMQConn is invoked) accordingly to avoid
holding p.mu during network I/O.
| else \ | ||
| echo "Finding packages with *_integration_test.go files..."; \ | ||
| dirs=$$(find . -name '*_integration_test.go' -not -path './vendor/*' -exec dirname {} \; 2>/dev/null | sort -u | tr '\n' ' '); \ | ||
| dirs=$$(find ./commons -name '*_integration_test.go' -not -path './vendor/*' -exec dirname {} \; 2>/dev/null | sort -u | tr '\n' ' '); \ |
There was a problem hiding this comment.
Vendor exclusion pattern no longer matches after scoping to ./commons.
-not -path './vendor/*' won’t exclude ./commons/vendor/*, so vendored files may leak into package discovery/security pre-checks.
🔧 Proposed fix
- dirs=$$(find ./commons -name '*_integration_test.go' -not -path './vendor/*' -exec dirname {} \; 2>/dev/null | sort -u | tr '\n' ' '); \
+ dirs=$$(find ./commons -name '*_integration_test.go' -not -path '*/vendor/*' -exec dirname {} \; 2>/dev/null | sort -u | tr '\n' ' '); \
- `@if` find ./commons -name "*.go" -type f -not -path './vendor/*' | grep -q .; then \
+ `@if` find ./commons -name "*.go" -type f -not -path '*/vendor/*' | grep -q .; then \Also applies to: 628-628
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@Makefile` at line 272, The find command that populates the dirs variable uses
-not -path './vendor/*', which doesn't exclude vendor dirs under the scoped
directory (e.g., ./commons/vendor); update the -not -path pattern in the dirs
assignment (the find invocation that sets dirs) to exclude vendor directories
anywhere under the scoped tree (for example use a wildcard path like
'*/vendor/*' or explicitly './commons/vendor/*') so vendored files are not
discovered; apply the same pattern change to the identical find usage that
appears later in the file.
| ## HTTP helpers (`commons/net/http`) | ||
|
|
||
| ### New in v4 | ||
| - Old: status-specific helper functions for each error kind | ||
| - v4: consolidated `Respond`, `RespondStatus`, `RespondError`, `RenderError`, `FiberErrorHandler` | ||
| - Old: legacy reverse proxy helpers without strict SSRF policy | ||
| - v4: `ServeReverseProxy(target, policy, res, req)` with `ReverseProxyPolicy` | ||
|
|
There was a problem hiding this comment.
Add explicit symbol mappings for renamed/removed APIs touched in this PR.
The map is now high-level, but it omits explicit old→new entries for renamed/removed symbols called out in this PR (for example, the CORS helper rename and removed dead wrappers). Please add those concrete symbol mappings so migrations remain actionable.
As per coding guidelines MIGRATION_MAP.md: "Update MIGRATION_MAP.md when a task touches renamed or removed v1 symbols".
Also applies to: 84-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@MIGRATION_MAP.md` around lines 40 - 46, Update MIGRATION_MAP.md to include
explicit old→new symbol mappings for the HTTP helpers section: list each
renamed/removed symbol from the diff (e.g., old status-specific helpers →
consolidated Respond, RespondStatus, RespondError, RenderError,
FiberErrorHandler; legacy reverse proxy helpers → ServeReverseProxy and
ReverseProxyPolicy) and add the CORS helper rename plus any removed dead wrapper
symbols referenced in this PR; ensure you add concrete mappings (oldName ->
newName or removed) for all touched symbols and mirror the same entries for the
other location referenced (lines 84-85) so migrations are actionable.
Summary
Applies comprehensive code review findings from a 7-reviewer parallel analysis (code, business logic, security, test quality, nil-safety, consequences, dead code) across all non-systemplane packages.
Tenant-Manager
ValidateCertPath), TLS/SSL runtime warnings, tenant ID removed from 404 responsesShouldSchedulenil-client guard, deterministic time source,Close()snapshot-then-cleanup patternSetAuthVerifiedhelper + migration guide for upstream auth assertionWith()/WithGroup()now preserve tenant_id injection chain, nil-receiver guardconfigfetch(1) andrevalidation(11) packagesCore Infrastructure
logPanicWithStackpath, removed deadlogPanicisWildcardCORSOriginfor clarity, addedrequestScopedLoggernil guardsnapshotConnectStateprecondition documentedBuild/Docs
REVIEW.mdremoved (superseded)MIGRATION_MAP.mdupdatedTesting
make cipasses (lint, format, tidy, security, vet, unit tests, integration tests)