feat(studio): resolver-parity shadow tripwire (decoupled telemetry)#1547
Conversation
|
Warning This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
This stack of pull requests is managed by Graphite. Learn more about stacking. |
miga-heygen
left a comment
There was a problem hiding this comment.
Review: feat(studio): resolver-parity shadow tripwire (decoupled telemetry)
Nice work, Vance. This is a well-reasoned reintroduction of the shadow concept, and the redesign addresses the coupling problem that got the original killed in s7.5. The v0.6.110 element_not_found regression is exactly the class of bug a writer-parity suite structurally cannot see, so having a resolver-level tripwire running in the wild fills a real gap. The PR description is genuinely excellent — the coverage table, the side-effect-free correctness argument, the soak exit criterion — all first-rate.
Let me walk through what I found.
Architecture & design — looks sound
The decoupling is clean: STUDIO_SDK_RESOLVER_SHADOW_ENABLED is completely independent of STUDIO_SDK_CUTOVER_ENABLED, shadow runs before every cutover gate, and exceptions are universally swallowed. The circular-import fix (extracting patchOpsToSdkEditOps into sdkOpMapping.ts) is the right call.
The dispatch-then-undo strategy in sdkResolverShadowCheck is the highest-risk piece, and it's well-tested (B5, B5b). The session.on("patch") → session.applyPatches(inverse) restore pattern is solid. Test B5b explicitly proves a cutover-style before/dispatch/after diff survives a preceding shadow run — that's the exact failure mode to guard against.
Findings
1. runResolverShadow emits telemetry even on parity (0 mismatches) — intentional?
In sdkResolverShadow.ts, runResolverShadow calls trackStudioEvent("sdk_resolver_shadow", ...) unconditionally after the check — even when mismatches.length === 0. Every single DOM edit with the flag ON fires a PostHog event. That's going to be a lot of telemetry at scale.
By contrast, recordResolverParity and recordAnimationResolverParity only emit on divergence (they early-return on success). The asymmetry reads like a bug — if it's intentional (e.g. to measure shadow run volume for the soak denominator), a comment saying so would help future readers understand the decision. If it's not intentional, gating with if (mismatches.length > 0) before the trackStudioEvent call would cut the noise dramatically.
Severity: medium. Won't cause incorrect behavior, but could generate significant telemetry volume in production.
2. Duplicate resolver checks in sdkGsapTweenPersist + dispatchGsapOpAndPersist
For set and remove GSAP ops, sdkGsapTweenPersist calls recordAnimationResolverParity(session, op.animationId, ...) before the gate. Then every caller of dispatchGsapOpAndPersist also passes resolverTarget: { animationId, opLabel }, which triggers another recordAnimationResolverParity inside dispatchGsapOpAndPersist. When cutover is ON, both fire for the same animationId and op, emitting two sdk_resolver_shadow events.
For add ops, this doesn't happen (sdkGsapTweenPersist calls recordResolverParity for element resolution, and the dispatch path does animation resolution — complementary checks, not duplicates). But for set/remove, it's a double-fire.
The dispatchGsapOpAndPersist resolver check was clearly added for callers that don't have their own pre-gate check (e.g. sdkGsapKeyframePersist, sdkGsapRemoveKeyframePersist, etc.), which is correct. The fix would be to remove the recordAnimationResolverParity calls for set/remove from sdkGsapTweenPersist, since those ops always flow through dispatchGsapOpAndPersist anyway.
Severity: low. Duplicate telemetry, not a correctness issue.
3. inverse array captures across batch boundary — works, but worth a comment
In sdkResolverShadowCheck, the on("patch") listener captures inverse patches, then the dispatch happens inside session.batch(...). The correctness depends on batch still firing per-op patch events that are captured by the listener (as opposed to, say, coalescing into a single composite patch with no per-op inverses). This clearly works today — tests prove it — but a one-liner comment noting why batch is compatible with per-op inverse capture would save a future SDK-internal refactor from accidentally breaking this invariant.
Severity: nit. No action needed, just a suggestion.
4. isShadowableOp — defensive but correct
The data-hf-* filter ensures internal markers don't produce spurious mismatches. The logic is sound. The early-exit when any op in the batch is unmapped (if (shadowable.some(op => !MAPPED_OP_TYPES.has(op.type))) return []) is conservative — it skips the entire batch rather than just the unmapped ops. That's fine for a telemetry-only tripwire (false negatives are acceptable; false positives are not), but it's worth being aware that a batch like [inline-style, some-custom-op] produces zero signal rather than partial signal.
Severity: nit — intentional design trade-off, no change needed.
5. Test coverage is thorough
24 tests across 7 groups: flag gating, side-effect isolation, resolver-parity detection, redaction, soak gate, element resolution, animation resolution. The poisoned-session pattern (makePoisonedStyleSession) is clever for forcing value mismatches without touching the SDK internals. Test B5b (cutover-style diff survives shadow) is the crown jewel — that's the test that would have caught the coupling bug that killed the original shadow.
One minor gap: no test for runResolverShadow with hfId = null | undefined. The code handles it (early return), but it's not explicitly tested. Not blocking.
Summary
The design is solid, the test coverage is strong, and the decoupling from the cutover flags is done correctly. The main substantive finding is #1 (telemetry on every parity-match edit could be noisy at scale) — worth a quick answer on whether that's intentional. Finding #2 is a minor cleanup opportunity. Everything else is nit-level.
LGTM — ready for stamp.
— Miga
miguel-heygen
left a comment
There was a problem hiding this comment.
Reviewed — resolver-parity shadow tripwire, telemetry-only, decoupled from cutover. 24 tests, sound design. LGTM.
james-russo-rames-d-jusso
left a comment
There was a problem hiding this comment.
Review: feat(studio): resolver-parity shadow tripwire (decoupled telemetry)
Hey Vance — second pass at SHA ae62e2c2, layering on top of Miga's review rather than parallel-posting. Disclosure: Magi Helper's GH approval landed via the miguel-heygen account; not the human Miguel.
TL;DR: Architecture is sound, the decoupling discipline is clean, B5/B5b are the load-bearing tests and they cover the exact failure mode commit #3's history shows you caught earlier. Telemetry-only invariant holds on disk (writeProjectFile is never reachable from the shadow path) and on the live session (inverse-patch restoration). I'm landing on comment-with-nits / ready for stamp modulo Miga's finding #1 (telemetry volume), plus two small additions below.
Confirming Miga's findings against the diff
Miga #1 (parity-match emits) — confirmed, real concern. runResolverShadow calls trackStudioEvent("sdk_resolver_shadow", …) unconditionally at sdkResolverShadow.ts:221-225, even when mismatches.length === 0. Asymmetric with recordResolverParity (line 249 early-returns on parity) and recordAnimationResolverParity (line 282 early-returns). At default-ON this fires per DOM edit on every style/text/attr commit — that's the chatty path of the editor. Either gate with if (mismatches.length > 0) OR keep it intentional and document it (parity-match denominator for the soak gate — a parity_observed_total counter shape makes that explicit). My read: if you want the denominator, emit a separate lower-cost event (sdk_resolver_shadow_run) with just {hfId_present: boolean, opCount} and reserve sdk_resolver_shadow for divergences. Once the dashboard names lock in they're hard to walk back.
Miga #2 (duplicate animation resolver checks) — does NOT reproduce. sdkGsapTweenPersist at line 241 calls dispatchGsapOpAndPersist(targetPath, sdkSession, deps, options, dispatchFn) — 5 args, no resolverTarget. The comment at lines 222-225 says you did this deliberately so the gate-off path still records parity. So set/remove fire exactly once. Worth a one-line reply to Miga to close that loop; the code's right.
Two additions
1. restore() is not in a finally — checkOpValue errors leak the patch handler + leave the session mutated. sdkResolverShadow.ts:177-181:
const mismatches = shadowable.map((op) => checkOpValue(op, el, hfId)).filter(...);
restore();
return mismatches;The dispatch path is in try/catch (lines 161-169) and explicitly calls restore() on both branches. The post-dispatch getElement null check at 172-175 also restores. But if checkOpValue ever throws between dispatch and restore — el.inlineStyles[…] getter throwing on a future Proxy-ified FlatEl, or any defensive assert added later — the patch handler stays subscribed on the live session AND the inverse patches never apply. The outer runResolverShadow try/catch (line 226) swallows the throw, so the host edit continues, but the session is now mutated AND has a leaked listener — which is exactly the cutover-coupling failure mode commit 218bee9d5f ("resolver shadow must not mutate the live session") was hardening against. In practice the synchronous object reads in checkOpValue won't throw today, but the discipline gap is real. Suggest:
try {
// dispatch + check
} finally {
restore();
}with the existing early-return shapes routed through the finally. Severity: low (no current trigger), but cheap to harden and matches the invariant the PR is built around.
2. PostHog cardinality — hfId + animationId in event properties. Both are explicit-string identifiers per edit. PostHog person-on-events stores them as event properties (not person properties), so they don't permanently inflate person profiles, but they DO end up as dim values on sdk_resolver_shadow events. Cardinality of hfIds in a typical studio session is bounded but unbounded across users + sessions. Probably fine — element ids are scoped to projects and aren't unbounded globally — but worth a deliberate choice. If you're querying "which hfId shape diverges most often," you want it. If you're only querying "what's the divergence rate by op kind," you don't. The opLabel set is fixed (10 string constants per my grep of sdkCutover.ts), so dim cardinality there is already bounded. Severity: nit.
What I verified, briefly
- Telemetry-only on disk:
sdkResolverShadow.tsdoes not importwriteProjectFile,markSelfWrite, or any history/persist symbol. Confirmed by grep. The PostHog path is the only side effect. - Telemetry-only on the live session: B5 (line 116) + B5b (line 133) pin the inverse-patch restoration. The SDK's batch fires one
PatchEventwithinversePatchesalready in reverse-apply order (session.ts:349reverses insidebuildPatchEvent), so theinverse.push(...e.inversePatches)+session.applyPatches(inverse)ordering is correct without further reversal. - Decoupling:
STUDIO_SDK_RESOLVER_SHADOW_ENABLEDandSTUDIO_SDK_CUTOVER_ENABLEDare independent flags (manualEditingAvailability.ts:100-114). No gate composes them. Each chokepoint records parity BEFORE its cutover gate so flag-off paths still emit. - Resolver-parity thesis pinning: C8 at
sdkResolverShadow.test.ts:172-188explicitly comments "Simulate the regression: SDK session cannot resolve the hfId the server would address" with the v0.6.110 class named in the test name. This is the pinning test for the headline claim. - Default-ON safety: flag defaults true at line 113; opt-out via
VITE_STUDIO_SDK_RESOLVER_SHADOW_ENABLED=false. Kill-switch parses correctly via the sharedresolveStudioBooleanEnvFlag(truthy/falsy sets cover"false","0","off", etc.). - Stack coherence: #1539 + #1545 are both merged. #1547's base ref
sdk-cutover-review-fixes-2sits 10+ commits ahead ofmainbut the GH mergeability check showsMERGEABLE. Graphite stack ordering will handle the rebase.
Stamp recommendation
Stamp-ready once Miga's #1 gets a one-line resolution (gate the parity emit OR call it intentional in a comment). The try/finally is a cheap follow-on, not a merge blocker. Stack is clean, CI is green, the test pins the v0.6.110 regression class explicitly.
— Rames D Jusso
…store in finally Addresses PR #1547 review (Miga, Rames): - #1 (medium): runResolverShadow emitted `sdk_resolver_shadow` on every edit, including parity (mismatchCount 0) — a PostHog event per style/text/attr edit at default-ON. Now emits only on divergence, matching recordResolverParity / recordAnimationResolverParity. Parity is silent across all three paths. - restore() moved into a `finally` in sdkResolverShadowCheck: if checkOpValue throws between dispatch and restore, the patch listener no longer leaks and the shared session is always undone (the cutover-coupling failure mode this module guards against). dispatch errors still return dispatch_error. - Comment on why batch is compatible with per-op inverse capture (a future SDK refactor that coalesces batch must keep emitting inverse patches). Tests: A2 now forces a divergence to emit; A2b pins parity-is-silent; A4 covers null/undefined hfId no-op. 26 shadow tests, full studio suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks both — fixes pushed in
26 shadow tests, full studio suite green, no circular dep. |
The base branch was changed.
jrusso1020
left a comment
There was a problem hiding this comment.
Approved at da8f5505. Both findings from Miga + Rames D Jusso R1 are addressed cleanly:
-
Parity-match emit gate (Miga #1 / chatty-emit) —
sdkResolverShadow.ts:227if (mismatches.length === 0) return;makes the divergence path silent on parity. The L224-226 rationale comment names the operational reason exactly: "Emit only on divergence — parity is silent, matching recordResolverParity and recordAnimationResolverParity. Otherwise this fires a PostHog event on every style/text/attr edit (the editor's chattiest path) at default-ON." Different shape from the_run+_shadowsplit I floated, but the goal — don't flood PostHog with parity matches at default-ON — is met, and the three emission paths are now consistently silent-on-parity / fire-on-divergence. -
restore()infinally(R1 nit) —sdkResolverShadow.ts:181wraps the dispatch intry { ... } finally { stopCapture(); if (inverse.length > 0) session.applyPatches(inverse); }. Comment at L161-163 names the failure shape: "restore() runs infinallyso the patch listener is always removed and the session is always undone — even if checkOpValue throws between dispatch and return. A residual mutation or leaked listener on the shared session is [...]" — exactly the failure class commit218bee9d5fwas hardening against, now structurally precluded rather than relying on the absence of throws.
Net: decoupled telemetry, side-effect-free on the live session, no PostHog flood at default-ON. Stack-tip clean. Note: mergeable_state: dirty — needs a rebase before it'll merge, but that's a tip-state thing, not a code concern.
miga-heygen
left a comment
There was a problem hiding this comment.
Re-review: feat(studio): resolver-parity shadow tripwire (decoupled telemetry)
Verified commit da8f5505 against all findings from the initial review (mine + Rames'):
#1 (divergence-only emit): Fixed. runResolverShadow now gates on if (mismatches.length === 0) return; before the trackStudioEvent call. Parity is silent across all three paths — recordResolverParity, recordAnimationResolverParity, and runResolverShadow now have consistent emit semantics. Test A2b pins parity-is-silent. Clean.
#3 (batch/inverse comment): Added. Explains why batch is compatible with per-op inverse capture and flags the invariant for future SDK refactors. Good.
#5 (null/undefined hfId test): Test A4 added — covers both null and undefined, asserts no throw and no event. Gap closed.
Rames: restore() in finally: This one was actually improved beyond what was asked. The entire dispatch-then-check sequence is now wrapped in try { ... } finally { stopCapture(); applyPatches(inverse) }, so even if checkOpValue throws between dispatch and restore, the patch listener is always cleaned up and the shared session is always undone. This is structurally better than the previous explicit restore() calls at each exit point. Dispatch errors still return dispatch_error (caught in the inner try).
#2 (duplicate animation checks): Confirmed non-issue as Vance explained — sdkGsapTweenPersist calls dispatchGsapOpAndPersist without resolverTarget, so no double-fire.
All findings addressed. 26 shadow tests, code is clean.
LGTM — ready for stamp.
— Miga
New sdkResolverShadow.ts module: checks whether the SDK session resolves the same element id the server path would address, then verifies value parity after in-memory dispatch. Emits sdk_resolver_shadow telemetry on divergence. Decoupled from STUDIO_SDK_CUTOVER_ENABLED via its own flag STUDIO_SDK_RESOLVER_SHADOW_ENABLED (default false). Headline signal: element_not_found — the resolver divergence class that caused the v0.6.110 regression. Writer-parity suite (#1533) cannot see this class; this tripwire exists specifically to catch it. All 12 acceptance-test-plan items pass (A1-A3, B4-B6, C7-C10, D11, E12). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tripwire should run out of the box — operators opt out, not in.
…e via inverse patches)
The shadow runs on the SAME sdkSession the cutover path uses, one line before
sdkCutoverPersist. sdkResolverShadowCheck dispatched the edit into that session
to read values back but never undid it — so with the shadow enabled the edit was
pre-applied, and sdkCutoverPersist then saw before === after and silently fell
back to the server path. Enabling the tripwire disabled cutover.
Fix: capture the inverse patches of the shadow dispatch (session.on("patch")) and
applyPatches them to restore the session before returning, on every path
(success, dispatch_error, element_not_found after dispatch). The session ends the
check exactly as it started; cutover's before/after diff is unaffected.
Tests: B5 now asserts the live session is restored (color back to original, not
left on the shadow value) and B5b proves a cutover-style before/dispatch/after
diff still fires after a preceding shadow run. The earlier B5 used two separate
sessions and so never exercised the shared-session path the bug lived in.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oints The shadow only ran on the DOM-edit path (inline-style/text/attribute via onTrySdkPersist) — blind to the rest of the cutover surface, which is where the resolver bugs that motivated it actually live (v0.6.110 was a GSAP property op; CF2 #15/#16 were timing-resolver bugs). On-for-everyone telemetry that only sees style/text/attr edits misses the riskiest paths. Adds a read-only element-resolution tripwire (recordResolverParity) — emits the headline `element_not_found` signal when the SDK can't resolve a target the server path is addressing, with NO dispatch/mutation. Wired before the cutover gate (decoupled) in the element-targeted chokepoints: sdkTimingPersist, sdkDeletePersist, and sdkGsapTweenPersist's add op. To avoid a circular import (sdkResolverShadow imported patchOpsToSdkEditOps from sdkCutover; sdkCutover now imports recordResolverParity from sdkResolverShadow), patchOpsToSdkEditOps moves to a neutral sdkOpMapping.ts that both import from. animationId-resolving GSAP ops (set/remove tween, keyframe ops, deleteAllForSelector) resolve an animation, not an element, so element-resolution parity doesn't apply — left as a follow-up (separate animation-resolution signal). Tests: recordResolverParity emit-on-divergence / parity-no-op / flag-off-no-op / read-only (no mutation). Full studio suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…not_found) Extends the tripwire to the GSAP-edit surface that resolves an animationId rather than an element: setGsapTween/removeGsapTween, addGsapKeyframe, removeGsapKeyframe, removeGsapProperty, removeAllKeyframes, convertToKeyframes. Adds recordAnimationResolverParity — read-only, emits the new `animation_not_found` kind when the SDK can't resolve the animationId the server GSAP path is addressing. The SDK's resolvable animation ids are the located ids attached to elements (buildAnimationIdMap), so a target absent from every element's animationIds is a resolver divergence. No dispatch, no mutation. Wired centrally in dispatchGsapOpAndPersist via an optional resolverTarget arg (runs before its cutover gate); sdkGsapTweenPersist records inline before its own leading gate (set/remove → animation parity, add → element parity). deleteAllForSelector resolves by selector, not an id — left out. Tests: animation_not_found on unresolved id / parity no-op on a real located id / flag-off no-op. Full studio suite green; no circular dep. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…store in finally Addresses PR #1547 review (Miga, Rames): - #1 (medium): runResolverShadow emitted `sdk_resolver_shadow` on every edit, including parity (mismatchCount 0) — a PostHog event per style/text/attr edit at default-ON. Now emits only on divergence, matching recordResolverParity / recordAnimationResolverParity. Parity is silent across all three paths. - restore() moved into a `finally` in sdkResolverShadowCheck: if checkOpValue throws between dispatch and restore, the patch listener no longer leaks and the shared session is always undone (the cutover-coupling failure mode this module guards against). dispatch errors still return dispatch_error. - Comment on why batch is compatible with per-op inverse capture (a future SDK refactor that coalesces batch must keep emitting inverse patches). Tests: A2 now forces a divergence to emit; A2b pins parity-is-silent; A4 covers null/undefined hfId no-op. 26 shadow tests, full studio suite green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
da8f550 to
ee3ad05
Compare

What
Reintroduces SDK shadow dispatch as a resolver-parity tripwire — a telemetry-only module, decoupled from the cutover flags, that checks whether the SDK resolves the same element / animation the server patch path targets. Default ON so we collect resolver-divergence telemetry in the wild and find what breaks; retired via the soak gate once parity is proven.
New files:
packages/studio/src/utils/sdkResolverShadow.ts— the module (24 tests insdkResolverShadow.test.ts)packages/studio/src/utils/sdkOpMapping.ts—patchOpsToSdkEditOpslives here so both the cutover path and the shadow can use it without a circular importChanged:
manualEditingAvailability.ts— flagSTUDIO_SDK_RESOLVER_SHADOW_ENABLED, default ON during the soak (envVITE_STUDIO_SDK_RESOLVER_SHADOW_ENABLED)sdkCutover.ts— resolver checks wired before each cutover gate (decoupled);patchOpsToSdkEditOpsimported fromsdkOpMappinguseDomEditSession.ts—runResolverShadowalongsideonTrySdkPersistsdkCutover.test.ts,sdkCutover.gate.test.ts— flag added to existing mocksWhy
Shadow was deleted in s7.5 because it was coupled to
STUDIO_SDK_SHADOW_ENABLED/ the cutover roll-out flags. But its real value was never writer parity — it was resolver parity: exercising the SDK'sresolveScopedpath against the live document on every edit. The recast-vs-acorn suite (#1533) that replaced it proves the two writers agree; it says nothing about whether the SDK resolves the same element the server addresses. That gap caused the v0.6.110element_not_foundregression — a resolver-level bug a writer-only gate structurally cannot see. This PR brings that coverage back as a decoupled redesign (not a revert, which would re-entangle it with the removed flags).How
Coverage — the whole cutover surface
element_not_found,value_mismatch,dispatch_errorelement_not_foundanimation_not_founddeleteAllForSelectorresolves by selector (not an element/animation id), so neither parity check maps cleanly — intentionally left out.Side-effect-free on the live session (the key correctness point)
The value check dispatches the edit into the session to read the SDK result back, then undoes it via the captured inverse patches (
session.on("patch")→session.applyPatches) so the session ends exactly as it started. The session is shared with the cutover path: a residual shadow mutation would make the followingsdkCutoverPersistseebefore === afterand silently fall back to the server. The element/animation resolution checks are read-only (justgetElement/getElements().animationIds) — no dispatch at all. This is what makes default-ON safe.Decoupling + failure isolation
Every check is gated only by
STUDIO_SDK_RESOLVER_SHADOW_ENABLED, runs before each cutover gate (so cutover on/off is irrelevant), and is wrapped so any exception is swallowed — never propagates to the user-visible edit.Telemetry + soak
Emits
sdk_resolver_shadow(viatrackStudioEvent→ PostHog).expected/actualare redacted to[redacted len=N]— no user content leaves the client.evaluateSoakGate(divergenceCount)→"parity-proven"when zeroelement_not_foundover the window → flip the default off / remove the flag.Test plan
sdkResolverShadow.test.ts(24): flag gating + cutover-decoupling; live-session restoration (the fix) + a cutover-stylebefore/dispatch/afterdiff surviving a shadow run;element_not_found(the v0.6.110 class) + inverse;value_mismatch; unmappable-op exclusion; redaction (style + text);recordResolverParity(timing/delete/add) emit / parity / flag-off / read-only;recordAnimationResolverParity(GSAP id ops) emit / parity / flag-off; soak gate. Full studio suite green; no circular dep.