Skip to content

feat(systemplane): add standardization helpers, catalog, and hardened service layer#404

Merged
fredcamaral merged 8 commits intodevelopfrom
feat/systemplane-helpers
Mar 28, 2026
Merged

feat(systemplane): add standardization helpers, catalog, and hardened service layer#404
fredcamaral merged 8 commits intodevelopfrom
feat/systemplane-helpers

Conversation

@fredcamaral
Copy link
Copy Markdown
Member

Summary

  • New catalog package with shared key validation and registration for Postgres and Redis backends
  • Domain coercion helpers for type-safe snapshot value conversion (bool, int, float64, duration, string slice) with overflow/NaN/Inf guards and case-insensitive bool parsing
  • Fail-closed default auth adapters (deny mutations, allow reads) and default identity adapters with Actor.ID length validation
  • Service layer refactoring — manager split into reads, writes, helpers; component diff detection; ordered shutdown helpers
  • Supervisor hardening — overflow-safe phase sorting, non-mutating tenant ID merge
  • Validation improvements — BootstrapConfig default case, DelegatingAuthorizer empty-action rejection, cloneSnapshot nil-equivalence, Swagger tag deduplication

Testing

  • make ci passes (lint, format, tidy, security, vet, unit tests, integration tests)
  • All new code includes comprehensive unit tests
  • 46 files changed, 4126 insertions, 396 deletions

… service layer

Adds catalog package with shared key validation and registration for
Postgres and Redis backends. Introduces domain coercion helpers for
type-safe snapshot value conversion (bool, int, float64, duration, string
slice) with overflow/NaN/Inf guards and case-insensitive bool parsing.

Adds fail-closed default auth adapters (deny mutations, allow reads) and
default identity adapters (system identity for internal operations) with
Actor.ID length validation.

Refactors service/manager into reads, writes, and helpers files. Adds
component diff detection for identifying changed settings between
snapshots. Introduces shutdown helpers for ordered systemplane teardown.
Extracts supervisor helpers with overflow-safe phase sorting and
non-mutating tenant ID merge.

Hardens BootstrapConfig.Validate with default case, DelegatingAuthorizer
with unconditional empty-action rejection, cloneSnapshot with nil
TenantSettings preservation, and Swagger MergeInto with tag
deduplication. Includes comprehensive tests.

X-Lerian-Ref: 0x1
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds a canonical configuration catalog and validation, introduces coercion and snapshot/setting helpers, tightens KeyDef validation (env var and redact policy), and extends schema/DTOs with EnvVar and RedactPolicy. It makes the bootstrap backend registry concurrency-safe, refactors manager/snapshot/supervisor flows into helpers, adds component-diffing and shutdown sequencing, provides default authorizer/identity implementations, restructures write/read flows in the manager, and adds extensive unit tests and minor formatting fixes.

Sequence Diagram(s)

mermaid
sequenceDiagram
participant Supervisor
participant Builder
participant Reconciler
participant Store
participant BundleHolder

Supervisor->>Builder: prepareReloadBuild(snapshot input)
Builder-->>Supervisor: (candidateSnapshot, candidateBundle, strategy)
Supervisor->>Reconciler: reconcileCandidateBundle(candidateBundle)
alt reconcile success
Reconciler-->>Supervisor: success
Supervisor->>Store: Put(candidateSnapshot, candidateBundle)
Store-->>Supervisor: revision
Supervisor->>BundleHolder: commitReload(revision) -> adopt resources? / emit ReloadEvent
Supervisor->>BundleHolder: close previous bundle (record cleanup errors)
else reconcile failure
Reconciler-->>Supervisor: error
Supervisor->>Builder: discardFailedCandidate(candidateBundle, strategy)
Supervisor-->>Supervisor: return error
end

Suggested reviewers

  • qnen
  • ClaraTersi
  • gandalf-at-lerian
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main changes: adding a catalog package, standardization helpers, and hardening the service layer. It is concise and directly reflects the primary objectives of the PR.
Description check ✅ Passed The PR description provides a comprehensive summary of changes (catalog, coercion helpers, auth/identity adapters, service layer refactoring, supervisor hardening, validation improvements) and confirms testing (make ci passes, comprehensive unit tests, 46 files/4126 insertions). It substantially exceeds the repository template requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
commons/systemplane/bootstrap/backend.go (1)

125-143: ⚠️ Potential issue | 🟡 Minor

Snapshot is taken but not used for factory lookup.

Line 125 takes a snapshot but discards the factories map. Line 140 then calls backendRegistry.factory() directly, which reads from the live registry. This creates a subtle inconsistency: init errors are checked from a snapshot, but factory lookup uses live state.

If a factory were registered after the snapshot but before the lookup, it would be used despite the snapshot intent. Consider using the snapshot consistently:

♻️ Suggested fix for consistency
 func NewBackendFromConfig(ctx context.Context, cfg *BootstrapConfig) (*BackendResources, error) {
-	_, initErrors := backendRegistry.snapshot()
+	factories, initErrors := backendRegistry.snapshot()
 	if len(initErrors) > 0 {
 		return nil, fmt.Errorf("bootstrap backend: init registration errors: %w", errors.Join(initErrors...))
 	}
 	// ... validation ...

-	factory, ok := backendRegistry.factory(cfg.Backend)
+	factory, ok := factories[cfg.Backend]
 	if !ok {
 		return nil, fmt.Errorf("%w %q (no factory registered)", errUnsupportedBackend, cfg.Backend)
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/bootstrap/backend.go` around lines 125 - 143, The
snapshot call stores initErrors but drops the snapshot of factories, causing a
live-registry lookup later; change the call to capture the snapshot (e.g.,
factories, initErrors := backendRegistry.snapshot()) and then use that captured
snapshot for the lookup instead of backendRegistry.factory (e.g., look up
cfg.Backend in the factories map or call a method on the snapshot) so the
initErrors check and factory selection are consistent; keep the existing error
handling with initErrors and errUnsupportedBackend/errNilBackendConfig as-is.
commons/systemplane/service/supervisor.go (1)

209-225: ⚠️ Potential issue | 🔴 Critical

The reload path mutates the active bundle during incremental builds while concurrent readers can access it lock-free.

BuildIncremental is contractually required to mutate the previousBundle (the currently-active bundle) by nil-ing out transferred component pointers. Although Reload() holds supervisor.mu during prepareReloadBuild and reconcileCandidateBundle, Current() reads and returns the bundle reference without acquiring the mutex—using only atomic.LoadPointer, which atomicizes the pointer itself, not the object's contents. A concurrent thread can therefore call Current(), obtain a reference to the bundle, and observe it being partially mutated by BuildIncremental.

To fix, either:

  1. Treat the active bundle as read-only during incremental builds and stage mutations on a detached copy before swap
  2. Have BuildIncremental mutate only the candidate, not the previous bundle
  3. Ensure Current() returns a defensive snapshot during reload windows

Per coding guidelines, "Keep behavior nil-safe and concurrency-safe by default in exported APIs."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/service/supervisor.go` around lines 209 - 225,
prepareReloadBuild/reconcileCandidateBundle currently pass the live active
bundle into BuildIncremental which mutates previousBundle (nil-ing transferred
pointers) while Current() may return that same object lock-free; to fix, ensure
incremental builds never mutate the live bundle: change prepareReloadBuild (and
the value passed into BuildIncremental) to create and use a detached deep copy
of the previous bundle (or otherwise ensure BuildIncremental only mutates the
candidate copy), so all pointer-nilling happens on the candidate and the active
bundle remains read-only; additionally, if making copy here is infeasible, make
Current() return a defensive snapshot during reload windows or document/guard
supervisor.mu so callers cannot observe in-place mutations (reference symbols:
BuildIncremental, prepareReloadBuild, reconcileCandidateBundle, Current,
supervisor.mu).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/systemplane/bootstrap/backend_test.go`:
- Around line 152-158: The repeated snapshot/restore pattern around
backendRegistry.snapshot(), backendRegistry.factories and
backendRegistry.initErrors should be extracted into a test helper; add a
t.Helper() function (e.g. withRegistrySnapshot) that captures origFactories,
origErrors := backendRegistry.snapshot() and registers t.Cleanup to restore
backendRegistry.factories and backendRegistry.initErrors, then replace the
duplicated snapshot/cleanup blocks in tests with a call to
withRegistrySnapshot(t).

In `@commons/systemplane/bootstrap/backend.go`:
- Around line 30-79: The registry mutates and reads s.factories without
synchronization; add a sync.RWMutex field to backendRegistryState and use it to
protect all accessors/mutators: use write locks in register(), reset(),
recordInitError() and any code that mutates s.factories or s.initErrors, and use
read locks in factory() and snapshot() (upgrade snapshot to copy under a read
lock before returning). Update the package-global backendRegistry usage
accordingly so RegisterBackendFactory and NewBackendFromConfig call into these
mutex-protected methods, ensuring the exported API is concurrency-safe by
default.

In `@commons/systemplane/bootstrap/config.go`:
- Around line 76-78: Replace the plain fmt.Errorf in the default branch with a
wrapped sentinel error for consistency: add a package-level sentinel error
ErrUnhandledBackend and change the default return to wrap that sentinel with the
backend name (using fmt.Errorf with %w) so callers can compare errors (e.g.,
errors.Is(err, ErrUnhandledBackend)); reference the default case handling
cfg.Backend and the validate function's default branch when making the change.

In `@commons/systemplane/catalog/validate.go`:
- Around line 117-119: The validation only checks env var mismatches when
allowedEnvVars(sk) returns a non-empty slice, so when the canonical key
intentionally defines no env vars a product can still set pd.EnvVar and go
unnoticed. In ValidateKeyDefs, after computing expectedEnvVars :=
allowedEnvVars(sk), add a branch: if len(expectedEnvVars) == 0 && pd.EnvVar !=
"" then append a mismatch to comparisons (use mismatchForString(pd, sk,
"EnvVar", "", pd.EnvVar) or the existing expected-format) so products that set
an env var where the catalog defines none are flagged; keep existing
containsString branch for the non-empty case. Ensure you reference
allowedEnvVars, pd.EnvVar, comparisons, mismatchForString and sk when making the
change.

In `@commons/systemplane/domain/coercion_helpers.go`:
- Around line 109-119: The float-to-int boundary checks fail at the 2^63
floating-point alias; update intFromFloat64 and scaleDurationFloat64 to reject
values that are at-or-beyond the integer bounds by using math.Nextafter to
detect the next representable float beyond the integer limits. Specifically,
after truncating (or before casting), compute maxF := float64(math.MaxInt) and
minF := float64(math.MinInt) and also nextMax := math.Nextafter(maxF,
math.Inf(1)) and nextMin := math.Nextafter(minF, math.Inf(-1)); return false if
value >= nextMax || value <= nextMin (or truncated >= nextMax / <= nextMin) so
numbers that collapse to the same float as the boundary are correctly rejected
before casting in intFromFloat64 and scaleDurationFloat64.

In `@commons/systemplane/domain/key_def.go`:
- Around line 154-156: The new error in domain.KeyDef.Validate() that returns
ErrInvalidRedactPolicy for keyDef.Secret && keyDef.RedactPolicy == RedactMask
introduces a breaking validation failure; instead preserve compatibility by not
returning an error for Secret+RedactMask—either normalize the value (e.g., set
keyDef.RedactPolicy = RedactFull inside Validate() for Secret keys) or simply
skip this check so existing callers remain valid. Update the logic in
KeyDef.Validate() to perform normalization rather than returning
ErrInvalidRedactPolicy (referencing keyDef.Secret, keyDef.RedactPolicy,
RedactMask, RedactFull and ErrInvalidRedactPolicy).

In `@commons/systemplane/ports/identity_defaults_test.go`:
- Around line 99-143: Add a table-driven test case to
TestFuncIdentityResolver_FailsClosedWithoutIdentity that covers a
FuncIdentityResolver whose ActorFunc is nil but DefaultActor is a
whitespace-only string (e.g., "   "); call resolver.Actor(ctx) with a valid
context and assert that it returns domain.ErrPermissionDenied just like the
empty/default-nil cases, referencing the FuncIdentityResolver struct, its
DefaultActor field, ActorFunc, and the Actor(ctx) method to locate the code to
modify.

In `@commons/systemplane/ports/identity_defaults.go`:
- Around line 35-63: Add a godoc comment for the exported method
FuncIdentityResolver.Actor that succinctly describes its behavior and
fail-closed semantics: note that it returns domain.ErrPermissionDenied on nil
receiver or nil context, that it trims whitespace and prefers the value returned
by ActorFunc(ctx) over DefaultActor (fallback) when ActorFunc produces an empty
string, and that it enforces maxActorIDLength returning ErrPermissionDenied for
overly long IDs; include mention that it returns domain.Actor{ID: id} on
success. Reference the symbols FuncIdentityResolver.Actor, ActorFunc,
DefaultActor, maxActorIDLength, and domain.ErrPermissionDenied in the comment.
- Around line 65-80: Add a godoc comment for FuncIdentityResolver.TenantID that
documents its fail-closed behavior and nil-safety, and enforce a maximum tenant
ID length by introducing a maxTenantIDLength constant and validating the trimmed
tenantID in TenantID(ctx context.Context) to return domain.ErrPermissionDenied
when tenantID is empty or exceeds the max; update any tests or callers if needed
and keep the existing nil/context checks and empty-string handling intact.

In `@commons/systemplane/service/component_diff.go`:
- Around line 194-204: Add a short comment above effectiveValueChanged
explaining that it uses reflect.DeepEqual to compare EffectiveValue.Value and
documenting DeepEqual semantics (e.g., nil slice != empty slice, functions
always unequal) and the expected/allowed types for EffectiveValue.Value
(primitives, slices, maps) so future maintainers know the comparison caveats; if
values may include nil-vs-empty slices or functions, note that they must be
normalized before comparison or that such types are not supported by this
comparison.

In `@commons/systemplane/service/manager_writes_helpers.go`:
- Around line 103-110: The call to manager.store.Put returns a persisted
revision (revision) but if manager.applyEscalation fails the function currently
returns an empty WriteResult; change the error path so that when
applyEscalation(ctx, plan.target, plan.escalation) returns an error you still
return the persisted revision in the WriteResult (e.g. return
WriteResult{Revision: revision, /* other needed fields */}, fmt.Errorf("apply
escalation: %w", err)) so callers can reconcile/retry safely; update the return
at the applyEscalation error branch to include the revision from
manager.store.Put and any other required fields of WriteResult.

In `@commons/systemplane/service/shutdown_test.go`:
- Around line 99-101: The test currently only asserts the call order when
tt.wantOrder != nil, skipping the "all nil steps" case; change the test to
assert unconditionally by calling assert.Equal(t, tt.wantOrder, order) (or
assert.Empty for clarity) without the surrounding if so the "all nil steps" case
is explicitly verified; update the assertion near tt.wantOrder and order in
shutdown_test.go accordingly.

In `@commons/systemplane/service/snapshot_builder_test.go`:
- Line 225: The test currently asserts values["app.name"].Redacted without
verifying the key exists, which can hide a missing entry; update the test in
snapshot_builder_test.go (where values is populated, e.g., after calling
initDefaults or the test setup that builds values) to first assert the key
exists (for example, check that val, ok := values["app.name"]; ok) and then
assert val.Redacted is false—i.e., replace the direct access with a presence
check followed by the Redacted assertion referencing the retrieved value.

In `@commons/systemplane/service/supervisor_helpers.go`:
- Around line 104-108: The current error-path calls candidate.Close(ctx) which
can tear down resources shared with the active bundle for incremental builds;
instead detect incremental candidates produced by BuildIncremental and call the
incremental discard path discardFailedCandidate(ctx, candidate, previousBundle)
so shared/adopted resources are released safely, otherwise fall back to
candidate.Close(ctx); update the block that currently checks
isNilRuntimeBundle(candidate) to invoke discardFailedCandidate when a
previousBundle is present/when candidate is incremental and only call
candidate.Close for non-incremental candidates.
- Around line 49-60: In discardFailedCandidate(), replace the three silent
ignores of Discard(ctx) and Close(ctx) (the branches testing
candidate.(discarderType) and rollbackDiscarder and the final candidate.Close
call) with best-effort structured logging: call the Discard/Close, capture the
returned error, and if non-nil log it via the package's structured logger/tracer
(include fields like "candidate", "operation" ("discard" or "close"), and the
error) before returning; similarly, in commitReload() where the previous
bundle's Close(ctx) is currently discarded, capture any error and emit a
structured log/trace entry with context (e.g., "previous_bundle",
"operation":"close", error) since that function cannot propagate errors. Ensure
you use the existing logger/tracer object available in the scope and do not
change control flow or error returns from these functions.

In `@commons/systemplane/swagger/swagger.go`:
- Around line 105-107: The exported godoc for MergeInto is out of sync with
mergeTags: update the MergeInto comment to state that systemplane tags are
merged with deduplication by tag name (i.e., tags from the source with a name
that already exists in the destination are skipped) instead of describing a
plain append; reference the mergeTags semantics in the comment and clearly
document that tag order and duplicates from the source will not be preserved by
MergeInto.
- Around line 127-148: The current merge loop only checks dstArr against the
existing map, so duplicate tag names inside srcArr still get appended; update
the loop that iterates srcArr in swagger.go to mark each parsed tag.Name into
the existing map before appending (use the same anonymous struct parsing as
before) and skip append when existing[tag.Name] is true, ensuring srcArr
duplicates are deduplicated; refer to symbols dstArr, srcArr, existing and the
local var tag when making the change.

---

Outside diff comments:
In `@commons/systemplane/bootstrap/backend.go`:
- Around line 125-143: The snapshot call stores initErrors but drops the
snapshot of factories, causing a live-registry lookup later; change the call to
capture the snapshot (e.g., factories, initErrors := backendRegistry.snapshot())
and then use that captured snapshot for the lookup instead of
backendRegistry.factory (e.g., look up cfg.Backend in the factories map or call
a method on the snapshot) so the initErrors check and factory selection are
consistent; keep the existing error handling with initErrors and
errUnsupportedBackend/errNilBackendConfig as-is.

In `@commons/systemplane/service/supervisor.go`:
- Around line 209-225: prepareReloadBuild/reconcileCandidateBundle currently
pass the live active bundle into BuildIncremental which mutates previousBundle
(nil-ing transferred pointers) while Current() may return that same object
lock-free; to fix, ensure incremental builds never mutate the live bundle:
change prepareReloadBuild (and the value passed into BuildIncremental) to create
and use a detached deep copy of the previous bundle (or otherwise ensure
BuildIncremental only mutates the candidate copy), so all pointer-nilling
happens on the candidate and the active bundle remains read-only; additionally,
if making copy here is infeasible, make Current() return a defensive snapshot
during reload windows or document/guard supervisor.mu so callers cannot observe
in-place mutations (reference symbols: BuildIncremental, prepareReloadBuild,
reconcileCandidateBundle, Current, supervisor.mu).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9464126b-b240-4f6b-bcea-e4ab8c67883d

📥 Commits

Reviewing files that changed from the base of the PR and between 3154d43 and 0df2a75.

📒 Files selected for processing (46)
  • commons/systemplane/adapters/http/fiber/dto.go
  • commons/systemplane/adapters/http/fiber/dto_test.go
  • commons/systemplane/bootstrap/backend.go
  • commons/systemplane/bootstrap/backend_test.go
  • commons/systemplane/bootstrap/config.go
  • commons/systemplane/catalog/doc.go
  • commons/systemplane/catalog/keys_postgres.go
  • commons/systemplane/catalog/keys_redis.go
  • commons/systemplane/catalog/keys_shared.go
  • commons/systemplane/catalog/shared_key.go
  • commons/systemplane/catalog/validate.go
  • commons/systemplane/catalog/validate_test.go
  • commons/systemplane/domain/coercion_helpers.go
  • commons/systemplane/domain/config_helpers.go
  • commons/systemplane/domain/key_def.go
  • commons/systemplane/domain/key_def_test.go
  • commons/systemplane/domain/setting_helpers.go
  • commons/systemplane/domain/snapshot_config_helpers_test.go
  • commons/systemplane/domain/snapshot_duration_slice_helpers_test.go
  • commons/systemplane/domain/snapshot_setting_helpers_test.go
  • commons/systemplane/domain/snapshot_test_helpers_test.go
  • commons/systemplane/ports/authorizer_defaults.go
  • commons/systemplane/ports/authorizer_defaults_test.go
  • commons/systemplane/ports/identity_defaults.go
  • commons/systemplane/ports/identity_defaults_test.go
  • commons/systemplane/service/component_diff.go
  • commons/systemplane/service/component_diff_test.go
  • commons/systemplane/service/manager.go
  • commons/systemplane/service/manager_helpers.go
  • commons/systemplane/service/manager_helpers_test.go
  • commons/systemplane/service/manager_read_helpers.go
  • commons/systemplane/service/manager_reads.go
  • commons/systemplane/service/manager_reads_test.go
  • commons/systemplane/service/manager_schema_test.go
  • commons/systemplane/service/manager_test.go
  • commons/systemplane/service/manager_writes.go
  • commons/systemplane/service/manager_writes_helpers.go
  • commons/systemplane/service/shutdown.go
  • commons/systemplane/service/shutdown_test.go
  • commons/systemplane/service/snapshot_builder.go
  • commons/systemplane/service/snapshot_builder_test.go
  • commons/systemplane/service/supervisor.go
  • commons/systemplane/service/supervisor_helpers.go
  • commons/systemplane/swagger/spec.json
  • commons/systemplane/swagger/swagger.go
  • commons/systemplane/swagger/swagger_test.go
💤 Files with no reviewable changes (2)
  • commons/systemplane/service/manager_test.go
  • commons/systemplane/service/manager_reads_test.go

Base automatically changed from develop to main March 27, 2026 17:45
jeffersonrodrigues92 and others added 2 commits March 27, 2026 20:33
Sort import groups alphabetically, remove trailing blank line in test,
and normalize struct field comment alignment in Manager.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/tenant-manager/consumer/multi_tenant_optional_rabbitmq_test.go`:
- Line 13: The import order was changed so the amqp import (amqp alias for
"github.com/rabbitmq/amqp091-go") is placed before the testify imports; reorder
the import block to group external dependencies together by moving the amqp
import so all external packages (e.g., "github.com/rabbitmq/amqp091-go" as amqp,
"github.com/stretchr/testify/assert", "github.com/stretchr/testify/require")
appear together before any internal project imports in
multi_tenant_optional_rabbitmq_test.go; this is a cosmetic fix—adjust the import
list in the file's import block to follow conventional grouping.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9292cb9c-6963-4afb-8898-7193b31916a8

📥 Commits

Reviewing files that changed from the base of the PR and between 0df2a75 and bf7b5a6.

📒 Files selected for processing (3)
  • commons/tenant-manager/consumer/multi_tenant_optional_rabbitmq_test.go
  • commons/tenant-manager/core/context_test.go
  • commons/tenant-manager/postgres/manager.go
💤 Files with no reviewable changes (1)
  • commons/tenant-manager/core/context_test.go

- Add sync.RWMutex to backend registry for concurrent safety
- Use snapshot consistently for factory lookup in NewBackendFromConfig
- Extract withRegistrySnapshot test helper to reduce duplication
- Add ErrUnhandledBackend sentinel error in config validation
- Fix EnvVar validation gap when catalog defines no env vars
- Fix float-to-int boundary at 2^63 (>= instead of > for overflow check)
- Accept Secret+RedactMask in KeyDef.Validate (runtime normalizes to Full)
- Add godocs for Actor/TenantID, maxTenantIDLength, whitespace test case
- Add DeepEqual semantics comment on effectiveValueChanged
- Return persisted revision on escalation failure in persistAndApplyWrite
- Assert unconditionally in shutdown_test (covers all-nil-steps case)
- Check key existence before asserting Redacted in snapshot_builder_test
- Use discardFailedCandidate for incremental build error path
- Document concurrency safety invariant in prepareReloadBuild
- Update MergeInto godoc and deduplicate srcArr tags in mergeTags
- Auto-formatting: struct field alignment in tenant-manager packages
@fredcamaral fredcamaral changed the base branch from main to develop March 28, 2026 18:36
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (3)
commons/systemplane/service/supervisor_helpers.go (2)

49-60: ⚠️ Potential issue | 🟠 Major

Handle cleanup failures instead of discarding them.

Lines 49-60 and 192-194 still drop Discard/Close errors with _ =, so failed rollbacks and superseded bundle shutdowns can leak resources without any diagnostic signal. These are best-effort paths, but they still need structured logging/tracing with operation context before returning.

As per coding guidelines, "Do not swallow errors; return or handle errors with context" and "Use the structured log interface Log(ctx, level, msg, fields...) instead of printf-style methods".

Also applies to: 192-194

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/service/supervisor_helpers.go` around lines 49 - 60, The
code is currently swallowing errors from candidate.Discard(ctx) and
candidate.Close(ctx); update the cleanup paths (the branch checking
rollbackDiscarder and the final Close call) to handle returned errors by calling
the structured logger Log(ctx, level, msg, fields...) with contextual fields
(e.g., "candidate", "operation":"discard" or "close", and the error) instead of
assigning to _; preserve the best-effort semantics by logging at debug/warn
level and then returning, and apply the same change to the similar cleanup at
the other location (the Close/Discard calls around the 192-194 logic).

132-139: ⚠️ Potential issue | 🟠 Major

Publish snapshot and bundle through one atomic state object.

Lines 176-177 are still two independent Store calls, so concurrent readers can observe a new snapshot paired with the old bundle, or vice versa. That breaks the documented atomic swap contract and weakens the guarantee described in Lines 132-139 once AdoptResourcesFrom starts mutating the retired bundle.

Run the following to confirm the mixed-state window remains:

#!/bin/bash
# Expectation: the implementation still publishes and reads snapshot/bundle separately,
# so the pair is not swapped atomically.
sed -n '130,195p' commons/systemplane/service/supervisor_helpers.go
echo '--- separate snapshot/bundle readers ---'
rg -n -C3 'snapshot\.Load\(|bundle\.Load\(' commons/systemplane/service

As per coding guidelines, "Preserve v4 public API contracts unless a task explicitly asks for breaking changes" and "Keep behavior nil-safe and concurrency-safe by default in exported APIs".

Also applies to: 176-180

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/service/supervisor_helpers.go` around lines 132 - 139,
The code currently does two separate Store calls (snapshot.Store and
bundle.Store) which allows a reader to see a mixed pair; change to publish a
single atomic state object that contains both the snapshot and bundle and call
Store once so Current() readers always see a consistent pair; update the
loader/reader code that uses snapshot.Load and bundle.Load to use the new single
state.Load, and ensure AdoptResourcesFrom, BuildIncremental and any callers that
expect the previousBundle still work by reading the bundle out of the atomic
state (preserving read-only sharing semantics).
commons/systemplane/domain/coercion_helpers.go (1)

109-123: ⚠️ Potential issue | 🟠 Major

Restore the 32-bit upper bound in intFromFloat64.

Line 119 fixes the 2^63 alias on 64-bit, but it now hard-codes a 64-bit ceiling for an int result. On 32-bit builds, truncated values above the 32-bit int max still pass and int(truncated) becomes implementation-specific.

Suggested fix
 func intFromFloat64(value float64) (int, bool) {
 	if math.IsNaN(value) || math.IsInf(value, 0) {
 		return 0, false
 	}
 
 	truncated := math.Trunc(value)
 
-	if truncated >= 1<<63 || truncated < math.MinInt {
-		return 0, false
-	}
+	if strconv.IntSize == 32 {
+		if truncated > math.MaxInt || truncated < math.MinInt {
+			return 0, false
+		}
+	} else if truncated >= 1<<63 || truncated < math.MinInt {
+		return 0, false
+	}
 
 	return int(truncated), true
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/domain/coercion_helpers.go` around lines 109 - 123, The
function intFromFloat64 currently uses a hard-coded 64-bit ceiling (1<<63) which
breaks on 32-bit builds; replace that constant with platform-specific int
bounds: compute the platform max and min int (e.g., intMax := int(^uint(0) >> 1)
and intMin := -intMax - 1 or their int64 equivalents), convert those to float64
for comparison, and use those values when rejecting out-of-range truncated
values in intFromFloat64 (ensure you still guard against float->int boundary
aliasing by rejecting values >= float64(intMax)+1 or > float64(intMax) as
appropriate and values < float64(intMin)). Ensure you update the comparisons
that currently reference 1<<63 and math.MinInt to use these platform-specific
bounds and then return int(truncated) when in range.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/systemplane/bootstrap/backend_test.go`:
- Around line 167-168: Test mutating code directly modifies
backendRegistry.factories without locking; add and use a test helper that
acquires backendRegistry.mu before mutating so tests follow the production
synchronization contract. Create helpers (e.g., setBackendFactoryLocked(name
domain.BackendType, f backendFactory) and deleteBackendFactoryLocked(name
domain.BackendType)) that lock backendRegistry.mu, perform the assignment or
delete on backendRegistry.factories, and unlock; replace direct mutations like
delete(backendRegistry.factories, domain.BackendPostgres) (and the other
occurrences cited) with calls to these locked helpers, or call them from
withRegistrySnapshot where appropriate.

In `@commons/systemplane/bootstrap/backend.go`:
- Around line 49-53: The recordInitError implementation appends nil errors,
causing initErrors to be non-empty even when no real error occurred; update
backendRegistryState.recordInitError to ignore nil (wrap existing mutex
lock/unlock and only append when err != nil) so stray RecordInitError(nil) calls
are safe—ensure the exported helper (RecordInitError) that forwards to
recordInitError continues to work unchanged but benefits from the nil guard.

In `@commons/systemplane/domain/coercion_helpers.go`:
- Around line 251-257: The type switch over raw in the duration coercion
function misses the time.Duration concrete type, so a time.Duration value won't
match case int64; update the switch (the raw variable in the function that calls
scaleDurationInt64 and scaleDurationFloat64) to include a case time.Duration
that converts v to int64 and calls scaleDurationInt64(int64(v), unit) so
time.Duration inputs are handled directly and consistently with other integer
types.

In `@commons/systemplane/ports/identity_defaults.go`:
- Around line 38-42: Update the Actor godoc to accurately describe the fallback
behavior: state that FuncIdentityResolver.Actor uses DefaultActor not only when
ActorFunc is nil or returns an empty string but also when it returns a
whitespace-only value (after strings.TrimSpace). Mention the use of ActorFunc
and DefaultActor in the comment so it aligns with the implementation of
FuncIdentityResolver.Actor.

In `@commons/systemplane/service/component_diff.go`:
- Around line 114-115: The function snapshotIsZeroValue currently only checks
Configs, GlobalSettings, and TenantSettings, allowing snapshots with non-zero
Revision or BuiltAt to be treated as zero; update snapshotIsZeroValue (in
commons/systemplane/service/component_diff.go) to also require Revision == 0 and
BuiltAt is the zero time (e.g., BuiltAt.IsZero()) as part of the conjunction so
the fast-path only triggers for a true zero-value snapshot.

In `@commons/systemplane/service/manager_writes_helpers.go`:
- Around line 98-115: The callers of persistAndApplyWrite (in particular the
function handling PatchResource/PatchSettings in manager_writes.go) currently
discard the returned partial WriteResult on error and return WriteResult{};
update those error paths to return the WriteResult value returned by
persistAndApplyWrite (e.g., propagate the returned result variable alongside the
error) so the persisted revision is preserved for callers to reconcile/retry;
apply the same change to the PatchSettings caller path as well, ensuring any
function that calls persistAndApplyWrite forwards the returned WriteResult when
err != nil instead of dropping it.

In `@commons/systemplane/swagger/swagger.go`:
- Around line 149-159: The code currently calls json.Unmarshal(raw, &tag) and
ignores errors and also allows a tag with an empty Name to be appended; change
the block around json.Unmarshal so you capture the error (err :=
json.Unmarshal(raw, &tag)) and propagate or return a contextual error instead of
continuing silently when unmarshal fails, and add a check for tag.Name == ""
that rejects/returns an error (or skips with a logged error) rather than
allowing a nameless tag to be added; update the logic that uses
existing[tag.Name] and dstArr = append(dstArr, raw) (the tag variable and
existing map usage) so the append only happens after successful unmarshal and a
non-empty tag.Name.

---

Duplicate comments:
In `@commons/systemplane/domain/coercion_helpers.go`:
- Around line 109-123: The function intFromFloat64 currently uses a hard-coded
64-bit ceiling (1<<63) which breaks on 32-bit builds; replace that constant with
platform-specific int bounds: compute the platform max and min int (e.g., intMax
:= int(^uint(0) >> 1) and intMin := -intMax - 1 or their int64 equivalents),
convert those to float64 for comparison, and use those values when rejecting
out-of-range truncated values in intFromFloat64 (ensure you still guard against
float->int boundary aliasing by rejecting values >= float64(intMax)+1 or >
float64(intMax) as appropriate and values < float64(intMin)). Ensure you update
the comparisons that currently reference 1<<63 and math.MinInt to use these
platform-specific bounds and then return int(truncated) when in range.

In `@commons/systemplane/service/supervisor_helpers.go`:
- Around line 49-60: The code is currently swallowing errors from
candidate.Discard(ctx) and candidate.Close(ctx); update the cleanup paths (the
branch checking rollbackDiscarder and the final Close call) to handle returned
errors by calling the structured logger Log(ctx, level, msg, fields...) with
contextual fields (e.g., "candidate", "operation":"discard" or "close", and the
error) instead of assigning to _; preserve the best-effort semantics by logging
at debug/warn level and then returning, and apply the same change to the similar
cleanup at the other location (the Close/Discard calls around the 192-194
logic).
- Around line 132-139: The code currently does two separate Store calls
(snapshot.Store and bundle.Store) which allows a reader to see a mixed pair;
change to publish a single atomic state object that contains both the snapshot
and bundle and call Store once so Current() readers always see a consistent
pair; update the loader/reader code that uses snapshot.Load and bundle.Load to
use the new single state.Load, and ensure AdoptResourcesFrom, BuildIncremental
and any callers that expect the previousBundle still work by reading the bundle
out of the atomic state (preserving read-only sharing semantics).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 3a6d133a-0c3e-456c-95ef-91080b60bebb

📥 Commits

Reviewing files that changed from the base of the PR and between bf7b5a6 and 2ff5629.

📒 Files selected for processing (22)
  • commons/systemplane/bootstrap/backend.go
  • commons/systemplane/bootstrap/backend_test.go
  • commons/systemplane/bootstrap/config.go
  • commons/systemplane/catalog/validate.go
  • commons/systemplane/domain/coercion_helpers.go
  • commons/systemplane/domain/key_def.go
  • commons/systemplane/domain/key_def_test.go
  • commons/systemplane/ports/identity_defaults.go
  • commons/systemplane/ports/identity_defaults_test.go
  • commons/systemplane/service/component_diff.go
  • commons/systemplane/service/manager_writes_helpers.go
  • commons/systemplane/service/shutdown_test.go
  • commons/systemplane/service/snapshot_builder_test.go
  • commons/systemplane/service/supervisor_helpers.go
  • commons/systemplane/swagger/swagger.go
  • commons/tenant-manager/client/client.go
  • commons/tenant-manager/consumer/multi_tenant_optional_rabbitmq_test.go
  • commons/tenant-manager/mongo/manager.go
  • commons/tenant-manager/mongo/manager_test.go
  • commons/tenant-manager/postgres/manager.go
  • commons/tenant-manager/rabbitmq/manager.go
  • commons/tenant-manager/rabbitmq/manager_test.go

- Guard nil init errors in RecordInitError to prevent false positives
- Propagate WriteResult.Revision on escalation failure in PatchConfigs/PatchSettings
- Include Revision and BuiltAt in zero-snapshot check for accurate diff
- Return error instead of silently skipping malformed source tags in swagger merge
- Reject empty-name tags in swagger merge to prevent spec corruption
- Route backend_test.go registry mutations through locked helpers
- Align Actor godoc with whitespace-only fallback behavior
@coderabbitai coderabbitai bot requested review from ClaraTersi and qnen March 28, 2026 20:12
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (1)
commons/systemplane/swagger/swagger.go (1)

142-156: ⚠️ Potential issue | 🟠 Major

Fail fast on empty source tag names instead of silently skipping.

On Line 154, empty source tag names are currently skipped with continue. That silently accepts invalid source metadata and contradicts the “rejected” behavior described in this block.

♻️ Proposed fix
 		if tag.Name == "" {
-			continue
+			return fmt.Errorf("swagger merge: src tag missing name")
 		}

As per coding guidelines, "Do not swallow errors; return or handle errors with context".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@commons/systemplane/swagger/swagger.go` around lines 142 - 156, The loop over
srcArr in swagger merge currently silently skips entries with empty tag.Name;
change that behavior to fail fast by returning a descriptive error instead of
using continue. In the block that unmarshals into the local tag struct (the
json.Unmarshal call and subsequent if tag.Name == "" check), replace the
continue with a return that wraps context (e.g., "swagger merge: empty src tag
name") so the caller sees invalid source metadata; reference the srcArr
iteration, the tag struct, and the json.Unmarshal location when making the
change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/systemplane/bootstrap/backend_test.go`:
- Around line 408-409: The test directly reads backendRegistry.factories and
backendRegistry.initErrors without holding the registry lock; change the
assertions to use backendRegistry.snapshot() (or call snapshot() into a local
variable) after ResetBackendFactories() so the reads are performed from the
locked snapshot. Specifically, replace direct accesses to
backendRegistry.factories and backendRegistry.initErrors in the test with values
obtained from backendRegistry.snapshot(), keeping assertions (e.g., assert.Len
and assert.NotEmpty) but operating on the snapshot results.
- Line 359: The test directly accesses backendRegistry.factories[testKind]
without acquiring backendRegistry's lock, which is inconsistent with other
locked helpers; change the access to read under the registry's lock (use
backendRegistry.lock/RLock around the read) or use the registry's snapshot/read
helper (e.g., call a provided snapshot() or GetFactories() method) to obtain a
copy and then index by testKind so the test consistently uses a
locked/read-snapshot approach for backendRegistry.factories.
- Line 379: The map read of backendRegistry.factories using key kind is
performed without holding the registry lock; wrap the lookup with the same read
lock used elsewhere (e.g., use backendRegistry.mu.RLock() before accessing
backendRegistry.factories[kind] and backendRegistry.mu.RUnlock() after—or use
defer to unlock) so the access is consistent with the protected read at the
other occurrence and avoids a race on backendRegistry.

In `@commons/systemplane/ports/identity_defaults.go`:
- Around line 17-25: Comments for ActorFunc, TenantFunc, and DefaultActor are
imprecise: they mention only "" but the Actor implementation trims whitespace
(treating whitespace-only values as empty). Update the exported field godocs for
ActorFunc, TenantFunc and DefaultActor to state they must return a non-empty,
non-whitespace tenant/actor ID (or that whitespace-only strings are treated as
empty) and that DefaultActor is used when ActorFunc is nil or returns an empty
or whitespace-only string; reference ActorFunc, TenantFunc, DefaultActor and the
Actor behavior when making the change.

In `@commons/systemplane/service/component_diff.go`:
- Around line 74-80: The godoc for ChangedComponents is misleading because it
currently implies any differing key marks a component changed, but the
implementation (in ChangedComponents) skips keys whose ApplyBehavior does not
require a bundle rebuild; update the comment to state explicitly that
ChangedComponents only returns components with differences that require a bundle
rebuild, that keys with non-rebuild ApplyBehavior are ignored, and that unknown
keys/invalid apply behaviors or unclassified rebuild-triggering keys force a
full rebuild for safety; reference the ApplyBehavior check in the implementation
and domain.Snapshot to make callers aware this is not a general-purpose
snapshot-diff API.

In `@commons/systemplane/service/manager_writes.go`:
- Around line 31-32: Validation failures from manager.validateConfigOps(req.Ops)
are currently returned directly and skip span error handling and wrapping;
update this branch to call manager.HandleSpanError(span, err) and return a
wrapped error consistent with the other paths (e.g., wrap the original error
with context like "validateConfigOps: %w") before returning WriteResult{} so
observability and error formatting match the other error branches.
- Around line 74-75: The call to manager.validateSettingOps(req.Ops,
subject.Scope) in manager_writes.go does not use HandleSpanError or wrap the
error with context like other error paths (e.g., PatchConfigs); update the error
handling so that if validateSettingOps returns an error you pass it through
HandleSpanError on the current span and return a wrapped error with contextual
text (include function/operation identifiers such as validateSettingOps,
subject.Scope and the surrounding write operation) to match existing patterns
used elsewhere in this function.

---

Duplicate comments:
In `@commons/systemplane/swagger/swagger.go`:
- Around line 142-156: The loop over srcArr in swagger merge currently silently
skips entries with empty tag.Name; change that behavior to fail fast by
returning a descriptive error instead of using continue. In the block that
unmarshals into the local tag struct (the json.Unmarshal call and subsequent if
tag.Name == "" check), replace the continue with a return that wraps context
(e.g., "swagger merge: empty src tag name") so the caller sees invalid source
metadata; reference the srcArr iteration, the tag struct, and the json.Unmarshal
location when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 45e14a08-9f4f-44d7-99b3-f7b9dbae1cb4

📥 Commits

Reviewing files that changed from the base of the PR and between 2ff5629 and c0f83d1.

📒 Files selected for processing (6)
  • commons/systemplane/bootstrap/backend.go
  • commons/systemplane/bootstrap/backend_test.go
  • commons/systemplane/ports/identity_defaults.go
  • commons/systemplane/service/component_diff.go
  • commons/systemplane/service/manager_writes.go
  • commons/systemplane/swagger/swagger.go

Replace two separate atomic pointers (snapshot + bundle) with a single
atomic.Pointer[supervisorState] so lock-free readers always see a
consistent (snapshot, bundle) pair. This eliminates the torn-read window
where Current() and Snapshot() could return mismatched states.

Record cleanup errors from discard/close paths to the active OTEL span
instead of silently discarding them with _ assignment.

Add 64-bit platform note to intFromFloat64 boundary constant.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@commons/systemplane/domain/coercion_helpers.go`:
- Around line 80-83: Several coercion helpers use raw == nil while others use
IsNilValue(raw), causing inconsistent handling of interface-wrapped nil
pointers; update tryCoerceInt, tryCoerceBool, tryCoerceFloat64, and
tryCoerceDuration to use IsNilValue(raw) for their nil-checks (same
behavior/output as before), so all coercion functions use the uniform IsNilValue
pattern; reference the functions tryCoerceInt, tryCoerceBool, tryCoerceFloat64,
and tryCoerceDuration and ensure IsNilValue is available in the package.
- Around line 109-126: intFromFloat64 (and likewise scaleDurationFloat64)
currently only guards against 64-bit overflow using 1<<63 and will still
overflow on 32-bit runtimes; add a runtime IntSize check using strconv.IntSize
to also reject values outside 32-bit range (e.g. if strconv.IntSize == 32 &&
(truncated >= 1<<31 || truncated < -1<<31) return failure) so casting to
int/int64 is safe, or alternatively add a build constraint at the top of
coercion_helpers.go to restrict the file to 64-bit platforms if 32-bit support
is not intended.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 18226445-dd2d-41d6-af59-284a307dfd13

📥 Commits

Reviewing files that changed from the base of the PR and between c0f83d1 and 5af74a4.

📒 Files selected for processing (3)
  • commons/systemplane/domain/coercion_helpers.go
  • commons/systemplane/service/supervisor.go
  • commons/systemplane/service/supervisor_helpers.go

@fredcamaral fredcamaral merged commit 940f788 into develop Mar 28, 2026
3 checks passed
@fredcamaral fredcamaral deleted the feat/systemplane-helpers branch March 28, 2026 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants