Skip to content

Commit 2d3b2e8

Browse files
authored
feat(run-engine): flag to route getSnapshotsSince through read replica (#3423)
## Summary Adds `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` (default `"0"`). When enabled, the Prisma reads inside `RunEngine.getSnapshotsSince` run against the read-only replica client instead of the primary. Offloads the snapshot-polling queries fired by every running task runner off the writer. ## Why `getSnapshotsSince` is called from the managed runner's fetch-and-process loop (once per poll interval, plus on every snapshot-change notification). It runs four sequential reads per call — one `findFirst` by snapshot id, one `findMany` on snapshots with `createdAt > X`, one raw SQL against `_completedWaitpoints`, and chunked `findMany` on `waitpoint`. Per concurrent run, every few seconds. It's read-only, tolerates a small amount of staleness, and is an obvious candidate for the replica. ## Replica-lag considerations - **Step 1 "since snapshot not found"**: if the runner just received a snapshot id from the primary and asks the replica before it replicates, the function throws and the caller treats the response as an error (runner falls back to a metadata refresh). Self-correcting, not silent. - **Step 2 missing newly-created snapshots**: the next poll's `createdAt > sinceSnapshot.createdAt` filter still picks them up once the replica catches up. - **Waitpoint junction race**: the riskiest path — if a latest snapshot is replicated but its `_completedWaitpoints` join rows aren't yet, the runner could advance past that snapshot with `completedWaitpoints: []`. WAL/storage-level replication replays commits in order, so in practice both should appear atomically on the reader, but the race window is why the flag ships disabled. Aurora reader shrinks all three windows to single-digit ms in typical conditions, and its storage-level replication gives atomic visibility of committed transactions on the reader. ## Test plan - [ ] Flip the flag on in a non-prod environment, confirm snapshot polling behaves normally and `getSnapshotsSince` errors in Sentry stay flat. - [ ] Verify writer query volume drops and reader query volume rises on the snapshot-polling queries. - [ ] Keep an eye on `AuroraReplicaLag` (or equivalent) during rollout.
1 parent 7c95ee4 commit 2d3b2e8

5 files changed

Lines changed: 15 additions & 1 deletion

File tree

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
area: webapp
3+
type: improvement
4+
---
5+
6+
Add `RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED` flag (default off) to route the Prisma reads inside `RunEngine.getSnapshotsSince` through the read-only replica client. Offloads the snapshot polling queries (fired by every running task runner) from the primary. When disabled, behavior is unchanged.

apps/webapp/app/env.server.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -835,6 +835,7 @@ const EnvironmentSchema = z
835835
.enum(["log", "error", "warn", "info", "debug"])
836836
.default("info"),
837837
RUN_ENGINE_TREAT_PRODUCTION_EXECUTION_STALLS_AS_OOM: z.string().default("0"),
838+
RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED: z.string().default("0"),
838839

839840
/** How long should the presence ttl last */
840841
DEV_PRESENCE_SSE_TIMEOUT: z.coerce.number().int().default(30_000),

apps/webapp/app/v3/runEngine.server.ts

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@ function createRunEngine() {
1919
logLevel: env.RUN_ENGINE_WORKER_LOG_LEVEL,
2020
treatProductionExecutionStallsAsOOM:
2121
env.RUN_ENGINE_TREAT_PRODUCTION_EXECUTION_STALLS_AS_OOM === "1",
22+
readReplicaSnapshotsSinceEnabled:
23+
env.RUN_ENGINE_READ_REPLICA_SNAPSHOTS_SINCE_ENABLED === "1",
2224
worker: {
2325
disabled: env.RUN_ENGINE_WORKER_ENABLED === "0",
2426
workers: env.RUN_ENGINE_WORKER_COUNT,

internal-packages/run-engine/src/engine/index.ts

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1633,7 +1633,8 @@ export class RunEngine {
16331633
snapshotId: string;
16341634
tx?: PrismaClientOrTransaction;
16351635
}): Promise<RunExecutionData[] | null> {
1636-
const prisma = tx ?? this.prisma;
1636+
const prisma =
1637+
tx ?? (this.options.readReplicaSnapshotsSinceEnabled ? this.readOnlyPrisma : this.prisma);
16371638

16381639
try {
16391640
const snapshots = await getExecutionSnapshotsSince(prisma, runId, snapshotId);

internal-packages/run-engine/src/engine/types.ts

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -145,6 +145,10 @@ export type RunEngineOptions = {
145145
/** Optional maximum TTL for all runs (e.g. "14d"). If set, runs without an explicit TTL
146146
* will use this as their TTL, and runs with a TTL larger than this will be clamped. */
147147
defaultMaxTtl?: string;
148+
/** When true, `getSnapshotsSince` reads through the read-only replica client instead
149+
* of the primary. Defaults to false. Callers passing an explicit `tx` always use
150+
* that client regardless of this flag. */
151+
readReplicaSnapshotsSinceEnabled?: boolean;
148152
tracer: Tracer;
149153
meter?: Meter;
150154
logger?: Logger;

0 commit comments

Comments
 (0)