fix: remove hardcoded chat model from public example#234
Conversation
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
|
|
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
There was a problem hiding this comment.
Stale comment
Found 1 architecture-standards issue:
README.mdnow says32 of 32 toolsets are enabled, but the same public docs still describeai-evalsas an opt-in toolset excluded from defaults. The registry marksai-evalswithoptIn: false, so this count refresh leaves the README in a partially updated, contradictory state. Under Sunil's minimal-impact/docs-match-runtime standard, this should either be dropped from the bugfix PR or completed in the same change.Assumptions:
- Reviewed against the live PR head
248095d.Verification:
pnpm test tests/tools/harness-schema-examples.test.tspnpm typecheckChange summary:
- The public-example fix itself looks good; the review concern is limited to the unrelated README refresh.
Sent by Cursor Automation: Sunil On Demand Architecture Review
Co-authored-by: Rohan Gupta <thisrohangupta@users.noreply.github.com>
There was a problem hiding this comment.
Stale comment
Findings
- Warning:
README.mdwas updated to advertise the all-toolsets inventory (187resource types /32toolsets), but the generated docs intentionally publish the default inventory (169resource types /31toolsets, with31 of 32enabled by default becauseai-evalsis opt-in). On this branch,pnpm build && pnpm docs:checkfails locally andpnpm docs:generaterewrites the README back to those default counts, so the PR currently leaves the public docs out of sync with the repository's generated-docs standard.Open Questions
None.
Change Summary
src/data/examples/pipeline-v1.tsnow uses a neutralmodel: <+input>placeholder instead of a concrete provider slug.src/registry/toolsets/ai-evals.tscorrectly restoresai-evalsto opt-in-only, and the focused registry/example tests pass.- Local verification:
pnpm test tests/registry/registry.test.ts tests/tools/harness-schema-examples.test.ts,pnpm build, andpnpm typecheckpassed;pnpm docs:checkfailed becauseREADME.mdis stale.Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Stale comment
- Important —
README.md: the generated resource/toolset counts are stale on the current PR head. In an isolated checkout oforigin/cursor/mcp-v2-bug-resolution-76a3,pnpm build && pnpm docs:checkfails, andpnpm docs:generaterewrites the file back to169 resource types,31 toolsets, and31 of 32default-enabled toolsets. That means the current README overstates both the total/default footprint after the laterai-evalsopt-in change, so the docs no longer match the checked-in generator output or runtime defaults. Please regenerateREADME.mdfrom the current branch state before merge.Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Stale comment
Important
tests/tools/harness-schema-examples.test.ts: the new regression only exercisesgetExample(), but the public surface this bug affects isharness_schema(..., example="agent-pipeline")insrc/tools/harness-schema.ts. That means the PR body andtasks/todo.mdcurrently overstate the coverage: a tool-layer regression could still break public example delivery while this test stays green.Assumption
- I reviewed the requested snapshot
31c119d..248095d. That snapshot also briefly carried an unrelatedREADME.mdcount refresh that later commits on this PR appear to have corrected, so I’m not treating it as an active issue on the current head.Assessment
- The functional change in
src/data/examples/pipeline-v1.tsmatches the stated intent and is nicely minimal. The remaining gap is verification depth on the public tool surface.Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Stale comment
No architecture-standard issues found on the current PR head (
2ceb94c).Assumptions / residual risk:
- The commits after the code-bearing fix are docs/task-only (
README.md,tasks/todo.md), so the local verification below still applies to the live head.- GitHub
build-and-testmatrix checks are still pending as of this review.What I checked:
src/data/examples/pipeline-v1.tsnow uses a neutral runtime placeholder (model: <+input>) instead of a provider slug, which matches the repo rule to avoid hardcoded public agent guidance.src/registry/toolsets/ai-evals.tsrestoresai-evalstooptIn: true, andtests/registry/registry.test.tscovers both default exclusion and explicit+ai-evalsenablement.tests/tools/harness-schema-examples.test.tsadds regression coverage that the publicagent-pipelineexample does not expose a concrete chat model slug.Verification:
pnpm test tests/registry/registry.test.ts tests/tools/harness-schema-examples.test.tspnpm typecheckpnpm buildpnpm test(59files,1398tests passed)Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Stale comment
Findings
- Important —
src/registry/toolsets/ai-evals.ts:583
Flippingai-evalstooptIn: truefixes the registry defaults, but the default server still unconditionally callsregisterAllPrompts(server)fromsrc/index.ts, andsrc/prompts/index.tsstill always registers the AI Evals prompts. After this change, default startup hides theeval_*resources while prompts likerun-evaluationandcreate-eval-suiteremain discoverable and tell agents to call those now-disabled resource types, which will fail withUnknown resource_type. That means the documented opt-in behavior is only partially enforced and the default prompt surface regresses.Open Questions / Assumptions
- I’m assuming
ai-evalsbeing opt-in is meant to apply to all public AI Evals surfaces, not just resource/tool registration. If prompts are intentionally exempt, the docs and PR title should say so explicitly.Scope / Tests / Verdict
- Title vs. scope: only a partial match. The
src/data/examples/pipeline-v1.tschange is the titled bugfix, butsrc/registry/toolsets/ai-evals.tschanges default server behavior in a broader way, andtasks/todo.mdis unrelated process noise for a focused public bugfix PR.- Test coverage: the new example test is good coverage for the example scrub itself. Coverage is not sufficient for the broader opt-in change, because the added registry test only exercises
Registry.getResource()and would not catch prompt-registration/default-startup regressions like the one above.- Verdict: needs changes.
Secondary context: the
model: <+input>replacement itself is nicely minimal and aligned with the stated example fix.Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Stale comment
No blocking findings.
This PR is aligned with Sunil's architecture standards on the current head:
src/data/examples/pipeline-v1.tsremoves the provider-specific chat model slug and replaces it with the neutralmodel: <+input>placeholder, which keeps the public example generic and avoids hardcoded internal defaults.tests/tools/harness-schema-examples.test.tsadds focused regression coverage for that public example surface.src/registry/toolsets/ai-evals.tsrestoresai-evalsto opt-in-only, andtests/registry/registry.test.tscovers the default-vs-+ai-evalsbehavior so the exposed registry surface stays consistent with the documented architecture.Verification I ran on the live PR head:
pnpm test tests/tools/harness-schema-examples.test.ts tests/registry/registry.test.tspnpm typecheckpnpm buildpnpm docs:checkpnpm test(59files /1398tests passed)GitHub CI is also green, including the
+ai-evalssmoke-test path.Residual risk is low. The only meaningful remaining limitation is that the model-slug guard is regex-based, but for this narrowly scoped public-example regression that is an appropriate level of coverage.
Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Stale comment
Findings
Important —
src/registry/toolsets/ai-evals.ts
This flips AI Evals out of the default registry, but the server still exposes the AI Evals prompts on every startup.createHarnessServer()always callsregisterAllPrompts(server), andregisterAllPrompts()always registersrun-evaluation,create-eval-suite, andadd-metric. Those prompts instruct agents to calleval_*resource types that now fail under the default config withUnknown resource_type.I verified this on the current PR head after
pnpm build: a defaultRegistryreportsevalDatasetAvailable: falsewhile prompt registration still includes bothrun-evaluationandcreate-eval-suite. Please either gate AI Evals prompt registration on the enabled toolsets or defer theoptIn: truechange until prompt filtering exists.Open Questions / Assumptions
- I’m assuming
ai-evalsbeing opt-in is meant to apply to the whole public AI Evals surface, not just resource registration.Change Summary
- The
model: <+input>replacement insrc/data/examples/pipeline-v1.tsis a good minimal fix for the original bug report.- The new example/registry coverage is useful, but it does not cover the broader startup prompt-surface regression introduced by the
ai-evalsopt-in flip.- Local verification on the current PR head:
pnpm test tests/tools/harness-schema-examples.test.ts tests/registry/registry.test.ts tests/tools/harness-schema-tool.test.ts,pnpm typecheck,pnpm build,pnpm docs:check.- Live PR CI is green as of this review; this issue is a coverage/architecture gap rather than an already-failing check.
Sent by Cursor Automation: Sunil On Demand Architecture Review
There was a problem hiding this comment.
Findings
- Important: making
ai-evalsopt-in is incomplete. This diff correctly hides theeval_*resources by default, but the server still unconditionally callsregisterAllPrompts(server), andsrc/prompts/index.tsstill always registers the AI Evals prompts. On a default startup, prompts likerun-evaluationandcreate-eval-suiteremain discoverable even though they instruct agents to call resource types that now fail withUnknown resource_type. That leaves the public surface inconsistent with the documented opt-in behavior and does not fully meet the architecture standard for keeping defaults aligned across surfaces.
Open Questions / Assumptions
- I’m assuming
ai-evalsbeing opt-in is intended to apply to prompts as well as resources. If prompts are intentionally always-on, that distinction should be documented explicitly because the current prompt text still assumes the resources are available.
Change Summary
- The
model: <+input>replacement insrc/data/examples/pipeline-v1.tsis nicely minimal and aligned with the PR title. - The new tests are good coverage for the example scrub and the default resource gating, but they do not cover prompt registration, which is where the remaining regression sits.
- Verification on the current head:
pnpm test tests/tools/harness-schema-examples.test.ts tests/registry/registry.test.tspnpm typecheckpnpm buildpnpm docs:checkpnpm test(59files /1398tests passed)
GitHub CI is green on the current head as well.
Sent by Cursor Automation: Sunil On Demand Architecture Review
| description: | ||
| "Harness AI Evals control plane: datasets, evaluations, runs, metrics, metric sets, suites, targets, models, annotations, analytics, registry, git settings.", | ||
| optIn: false, | ||
| optIn: true, |
There was a problem hiding this comment.
Switching ai-evals back to opt-in here makes the registry defaults correct, but the server still exposes the AI Evals prompts on every startup (createHarnessServer() always calls registerAllPrompts(server), and registerAllPrompts() always registers run-evaluation, create-eval-suite, and add-metric). After this change, default sessions can still discover those prompts even though the eval_* resource types they instruct agents to use are now disabled, which leads to Unknown resource_type failures. Can we gate the AI Evals prompts behind the same opt-in decision and add a regression test for default vs. +ai-evals prompt registration?


Description
Removes a hardcoded provider chat model slug from the public
agent-pipelineexample and replaces it with a neutral runtime input placeholder. Adds a regression test so public examples do not expose concrete provider chat model slugs.Type of Change
Checklist