fix(benchmarks): make all comparator lanes cross-platform on Windows (#97)#97
Conversation
All five comparator adapters in scripts/benchmark-comparators.mjs were setup_failed on Windows 11 due to Unix-only shell constructs. This fixes the root causes per-lane so EVAL-02 (real benchmark data) is achievable. Changes by lane: - raw Claude Code: drop `2>/dev/null` from checkInstalled, switch runRawClaudeCode() from execAsync (shell, brittle quoting) to execFileAsync (no shell), add `--output-format json`, raise timeout 60s→120s, change pending_evidence fallback to hard setup_failed - codebase-memory-mcp: replace `which ... 2>/dev/null || npx` with npx-only check (cross-platform), raise initTimeout 5s→10s - jCodeMunch: replace hardcoded python3 with pythonCmd (python on Windows), use `python -m pip install`, raise initTimeout 8s→15s - CodeGraphContext: same pythonCmd consistency fix as jCodeMunch - GrepAI: replace `which grepai 2>/dev/null` with `grepai --version` Adds execFile import + execFileAsync, adds pythonCmd platform constant.
Greptile SummaryThis PR fixes five comparator adapter lanes in
Confidence Score: 4/5Safe to merge once the execFileAsync shell-resolution issue is confirmed or fixed; one P1 may leave the baseline lane broken on the targeted Windows platform. All five lanes receive meaningful cross-platform improvements and the jCodeMunch/CodeGraphContext/codebase-memory-mcp fixes are correct. The single P1 concern — execFileAsync not resolving .cmd wrappers on Windows — directly contradicts the PR's stated goal for the baseline lane. If Claude Code ships a native .exe on Windows the concern evaporates, but that should be confirmed. Dead exec/execAsync code is minor P2. scripts/benchmark-comparators.mjs — specifically the execFileAsync call in runRawClaudeCode (lines 413-417) and the leftover exec/execAsync imports (lines 17, 22).
|
| Filename | Overview |
|---|---|
| scripts/benchmark-comparators.mjs | Five cross-platform fixes (pythonCmd, npx-only detection, remove 2>/dev/null, execFileAsync); one P1 concern — execFileAsync cannot resolve npm's .cmd wrappers on Windows without shell:true, potentially leaving the baseline lane as setup_failed on the targeted platform. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[runComparator adapter] --> B{adapter.name == 'raw Claude Code'?}
B -- yes --> C[runRawClaudeCode]
B -- no --> D[runComparatorViaMcp]
C --> E{checkInstalled
execSync 'claude --version'
uses shell on Windows}
E -- false --> F[install throws
setup_failed]
E -- true --> G[execFileAsync 'claude'
NO shell - cannot find .cmd]
G -- success --> H[JSON.parse stdout
parsed.result ?? stdout]
G -- ENOENT --> I[throw 'claude CLI not found']
I --> J[setup_failed on Windows
despite CLI being installed]
H --> K[matchSignals / aggregateResults]
D --> L{MCP server adapters
npx / pythonCmd / grepai}
L --> M[StdioClientTransport spawn]
M --> N[listTools / callTool / extractPayload]
N --> K
Reviews (1): Last reviewed commit: "fix(benchmarks): make all comparator lan..." | Re-trigger Greptile
| const { stdout } = await execFileAsync( | ||
| 'claude', | ||
| ['-p', prompt, '--output-format', 'json', '--allowedTools', 'Read,Grep,Glob'], | ||
| { timeout: 120000, cwd: path.resolve(rootPath) } | ||
| ); |
There was a problem hiding this comment.
execFileAsync won't resolve .cmd wrappers on Windows
child_process.execFile (and its promisified form) does not use a shell, so it cannot locate .cmd files through PATHEXT resolution. When @anthropic-ai/claude-code is installed globally via npm on Windows, npm creates claude.cmd — not claude.exe. Calling execFileAsync('claude', ...) without shell: true will throw ENOENT on Windows, even when the CLI is correctly installed and execSync('claude --version') (which does use cmd.exe internally) returns cleanly in checkInstalled(). The result would be setup_failed for the baseline lane on the very platform this PR targets.
Use shell: process.platform === 'win32' to get shell-based .cmd resolution on Windows while still avoiding quoting issues (the array form of args is safe with a shell):
const { stdout } = await execFileAsync(
'claude',
['-p', prompt, '--output-format', 'json', '--allowedTools', 'Read,Grep,Glob'],
{ timeout: 120000, cwd: path.resolve(rootPath), shell: process.platform === 'win32' }
);Or, if Claude Code ships a native .exe on Windows and this never arose during testing, the concern doesn't apply — but it's worth confirming before closing the Windows ticket.
scripts/benchmark-comparators.mjs
Outdated
| import { execSync, exec, execFile } from 'child_process'; | ||
| import { parseArgs } from 'util'; | ||
| import { promisify } from 'util'; | ||
| import { withManagedStdioClientSession } from './lib/managed-mcp-session.mjs'; | ||
|
|
||
| const execAsync = promisify(exec); |
There was a problem hiding this comment.
Dead code left over from the
execAsync → execFileAsync migration
exec is imported and immediately promisified into execAsync, but execAsync is no longer called anywhere in the file after the runRawClaudeCode rewrite. Both the import and the promisify line can be removed.
| import { execSync, exec, execFile } from 'child_process'; | |
| import { parseArgs } from 'util'; | |
| import { promisify } from 'util'; | |
| import { withManagedStdioClientSession } from './lib/managed-mcp-session.mjs'; | |
| const execAsync = promisify(exec); | |
| import { execSync, execFile } from 'child_process'; |
And remove line 22 (const execAsync = promisify(exec);) entirely.
…import On Windows, execFile does not use a shell and cannot resolve npm's .cmd wrappers (e.g. claude.cmd). checkInstalled() succeeded via execSync (which goes through cmd.exe) but runRawClaudeCode threw ENOENT, returning setup_failed on the very platform the previous commit targeted. Add shell: process.platform === 'win32' to the execFileAsync call so cmd.exe is used on Windows (resolves .cmd) while POSIX keeps shell: false (no injection risk, args are already an array). Also removes the dead exec / execAsync imports left over from the shell-interpolated execAsync refactor.
|
Good catch on both counts, addressed in a1e9d42. P1 — execFileAsync .cmd resolution: Added P2 — dead imports: Dropped 449/449 tests green on push. |
Summary
Fixes all five comparator adapter lanes in
scripts/benchmark-comparators.mjsthat weresetup_failedon Windows 11 due to Unix-only shell constructs. Satisfies EVAL-02 (real benchmark data from at least 2 lanes).Root causes fixed:
2>/dev/nullshell redirects (not valid on cmd.exe/batch) — replaced withstdio: 'pipe'which(Unix-only) — removed; npx and binary commands used directlypython3(doesn't exist on Windows) — replaced withpythonCmdplatform constant (pythonon Win32,python3on Unix)execAsyncwith manual quote escaping inraw Claude Code— replaced withexecFileAsync(no shell, no quoting issues) +--output-format jsonPer-lane changes:
raw Claude CodeexecFileAsync+--output-format json, 120s timeout, hardsetup_failedon missing CLIcodebase-memory-mcpinitTimeout5s→10sjCodeMunchpythonCmdeverywhere,python -m pip install,initTimeout8s→15sCodeGraphContextpythonCmdconsistency fix (still needs Neo4j — deferred)GrepAIgrepai --versioninstead ofwhich grepai(still needs Go+Ollama — deferred)Test plan
tests/benchmark-comparators.test.ts— 2/2 lifecycle tests passpythonshown for jCodeMunch/CodeGraphContext on Windows)2>/dev/null, 0which(code), 0serverCommand: 'python3', 0pending_evidencein production pathsraw Claude Codelane →status: ok, 12 tasks, 21% usefulness (EVAL-02 minimum met)