diff --git a/.codex/skills/runnable-libcrypto-canonical-eval/SKILL.md b/.codex/skills/runnable-libcrypto-canonical-eval/SKILL.md new file mode 100644 index 000000000..534f827b5 --- /dev/null +++ b/.codex/skills/runnable-libcrypto-canonical-eval/SKILL.md @@ -0,0 +1,108 @@ +--- +name: runnable-libcrypto-canonical-eval +description: Run or audit the canonical libcrypto evaluation contract for SYM-20 using the repo-local 2026-04-28 ground-truth bundle, fresh runnable-cmp-eval compares, and explicit address mapping. Use when a task mentions SYM-20, canonical libcrypto eval, gtBlock.pb, the 2026-04-28 ground truth, or when historical CSV/coverage-sidecar results must be distinguished from the authoritative compare path. +--- + +# Runnable Libcrypto Canonical Eval + +Use this skill when the task is specifically about the authoritative libcrypto benchmark contract, not just generic Runnable compare metrics. + +## Canonical Contract + +For `SYM-20`, the authoritative repo-local contract is: + +1. Canonical binary: + - `python3 scripts/libcrypto_bench_paths.py binary --must-exist` +2. Canonical ground truth protobuf: + - `python3 scripts/libcrypto_bench_paths.py groundtruth-pb --must-exist` +3. Address mapping: + - derive `.text` start from the ELF + - quick check: `python3 scripts/libcrypto_bench_paths.py text-start` + - keep Runnable rebase at `0x50000000` unless the user explicitly provides another contract +4. Compare path: + - `python3 scripts/validate_libcrypto_ground_truth.py cmp --ll /abs/path/to/file.ll --out-dir /abs/path/to/out` + +The wrapper records fresh `HIT`, `MISMATCH`, `OBJ_ONLY`, `LL_ONLY`, `FALSE_NEGATIVE`, `FALSE_POSITIVE`, `precision`, and `recall`. + +## Quick Start + +Gap-audit the canonical GT bundle: + +```bash +python3 scripts/validate_libcrypto_ground_truth.py gap-audit \ + --out-dir runs/groundtruth_validation/canonical_gap_audit +``` + +Compare a lift under the canonical contract: + +```bash +python3 scripts/validate_libcrypto_ground_truth.py cmp \ + --ll /abs/path/to/libcrypto.ll \ + --out-dir runs/groundtruth_validation/canonical_cmp +``` + +If you already know the exact binary and want the lower-level wrapper: + +```bash +python3 .codex/skills/runnable-cmp-eval/scripts/run_cmp_eval.py \ + --binary /abs/path/to/libcrypto.so.3 \ + --ll /abs/path/to/libcrypto.ll \ + --text-start 0x... \ + --runnable-base 0x50000000 +``` + +## Historical / Non-Canonical Paths + +Do **not** report the following as the canonical `SYM-20` result without an explicit label: + +- `coverage_sidecar_union_threadpool16` +- CSV-only GT flows such as `ground_truth.csv` +- `rebase_base=0x50400000` +- old libcrypto binaries whose SHA differs from the repo-local canonical `2026-04-28` bundle + +Those paths are useful for forensics and side-by-side experiments, but they are not the authoritative compare contract. + +When historical artifacts are involved, report them as: + +- `historical sidecar-union / non-canonical` +- `old-binary compare / non-canonical` +- `canonical gtBlock.pb compare / authoritative` + +## Required Reporting + +Always include: + +- absolute binary path +- absolute GT path +- absolute `.ll` path +- binary SHA when contract mismatch is possible +- `.text` start +- Runnable base +- `HIT`, `MISMATCH`, `OBJ_ONLY`, `LL_ONLY` +- `FALSE_NEGATIVE`, `FALSE_POSITIVE` +- `precision`, `recall` +- whether the result is `canonical` or `non-canonical` + +## Current Repo Evidence + +Use this note when you need the latest repo-local evidence and caveats: + +- `docs/exp/2026-05-10-libcrypto-canonical-eval-contract.md` + +Current repo-local comparable pair: + +- baseline: + - `runs/runnable-dev-2026-0429-textstart-serial/libcrypto.so.3.entry_0x500cef80.ll` +- optimized: + - `runs/runnable-dev-2026-0501-textstart-dynamic-reapfix/libcrypto.so.3.entry_0x500cef80.dynamic.ll` + +Both already compare against the canonical `2026-04-28` binary with: + +- `.text_start=0xcef80` +- `runnable_base=0x50000000` + +Prefer these over old `2026-04-14` baseline artifacts when the goal is to produce a same-contract baseline/optimized comparison for `SYM-20`. + +## Caveat + +The canonical `gtBlock.pb` bundle still has documented coverage gaps. Run the gap audit and report those counts instead of assuming objdump and GT are identical. diff --git a/.codex/skills/runnable-libcrypto-canonical-eval/agents/openai.yaml b/.codex/skills/runnable-libcrypto-canonical-eval/agents/openai.yaml new file mode 100644 index 000000000..2f86c9b63 --- /dev/null +++ b/.codex/skills/runnable-libcrypto-canonical-eval/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "Runnable Libcrypto Canonical Eval" + short_description: "按 SYM-20 canonical 合同评估 libcrypto" + default_prompt: "Use $runnable-libcrypto-canonical-eval to run or audit the canonical SYM-20 libcrypto evaluation contract, distinguish it from historical sidecar-union metrics, and report fresh HIT/FN/FP/precision/recall counters." diff --git a/.codex/skills/runnable-symphony-followup/SKILL.md b/.codex/skills/runnable-symphony-followup/SKILL.md new file mode 100644 index 000000000..21ef7424b --- /dev/null +++ b/.codex/skills/runnable-symphony-followup/SKILL.md @@ -0,0 +1,93 @@ +--- +name: runnable-symphony-followup +description: Draft or submit Runnable follow-up or backlog Linear issues discovered during implementation under the current Symphony workflow. Use when the main task reveals out-of-scope work that should become a separate Runnable issue with clear title, description, acceptance criteria, and validation. +--- + +# Runnable Symphony Follow-Up + +Use this skill when a new issue should be split out from current work instead of expanding the active ticket. + +This repo's `WORKFLOW.md` explicitly requires out-of-scope findings to become separate `Backlog` issues with clear scope. This skill standardizes that split. + +## Default Goal + +Create a follow-up issue draft that: + +- is clearly separate from the current ticket, +- is small enough to be executed independently, +- is suitable for `Backlog` by default, +- explains why it was split out instead of handled now. + +## Default Behavior + +- Default target state: `Backlog`. +- Default relationship intent: mark it as related to the current issue when the available Linear tool supports verified relation creation. +- If the follow-up cannot proceed until the current issue lands, note that it should also be linked with `blockedBy` when the available tool supports it. +- Keep the issue narrowly scoped to one concrete gap. + +If a Linear tool is available in-session, prefer creating the follow-up issue instead of only drafting it, but only create `related` / `blockedBy` links when the tool exposes a verified relation operation. + +## Required Structure + +```md +Title: + +Reason for split: +- + +## Summary + + + +## Problem + +- +- +- + +## Scope + +- In scope: +- Out of scope: + +## Acceptance Criteria + +- [ ] +- [ ] + +## Validation + +- [ ] +``` + +## When To Use + +- During implementation you notice a real bug or missing capability not required to complete the current issue. +- The current issue would become much harder to review if the new work were included. +- A cleanup, metrics extension, robustness pass, or doc correction deserves separate tracking. + +## When Not To Use + +- The new work is required to satisfy the current issue's acceptance criteria. +- The problem is only speculative and has no concrete evidence yet. +- The issue is so large that it should be split into multiple follow-ups instead of one. + +## Runnable-Specific Guidance + +- Mention concrete evidence paths when available: run directories, docs, scripts, tests, logs. +- For experiment-related follow-ups, specify the expected report artifact path. +- For code follow-ups, specify the likely code/test/doc surfaces but keep implementation flexibility. +- If you know the current issue identifier, mention that the new issue should be linked as `related` when supported by the available Linear tool. +- If the dependency is real, explicitly say the new issue should be linked as `blockedBy` the current issue when supported by the available Linear tool. +- If you are inside a Symphony issue session, prefer creating the follow-up under the same project/team as the current issue. +- If relation creation is not supported by the available Linear tool in-session, still create or draft the issue and report the intended `related` / `blockedBy` linkage explicitly instead of guessing mutation syntax. + +## Output Style + +Prefer returning: + +1. a final proposed title, +2. the draft issue body, +3. a one-line note saying `Suggested initial state: Backlog`. + +Do not bloat the follow-up with execution plans that belong in the future workpad comment. diff --git a/.codex/skills/runnable-symphony-followup/agents/openai.yaml b/.codex/skills/runnable-symphony-followup/agents/openai.yaml new file mode 100644 index 000000000..e53a7e817 --- /dev/null +++ b/.codex/skills/runnable-symphony-followup/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "Runnable Symphony Follow-Up" + short_description: "为 Runnable 拆分 Symphony follow-up / backlog issue" + default_prompt: "Use $runnable-symphony-followup to turn an out-of-scope finding into a separate Runnable Linear backlog issue with clear scope, acceptance criteria, and validation." diff --git a/.codex/skills/runnable-symphony-issue/SKILL.md b/.codex/skills/runnable-symphony-issue/SKILL.md new file mode 100644 index 000000000..01151e668 --- /dev/null +++ b/.codex/skills/runnable-symphony-issue/SKILL.md @@ -0,0 +1,132 @@ +--- +name: runnable-symphony-issue +description: Create, submit, or refine Runnable Linear issues intended to be executed through Symphony. Use when the user wants to file or 提交 a new issue, rewrite an issue draft, or standardize ticket content so Runnable work can be picked up cleanly by the current Symphony workflow in `WORKFLOW.md`. +--- + +# Runnable Symphony Issue + +Use this skill when the task is to write a new Runnable issue for the Symphony + Linear workflow, not to implement the issue itself. + +This repo's issue workflow is defined in `WORKFLOW.md`. Follow that contract instead of inventing a generic ticket format. + +## Default Goal + +Produce issue text that is immediately usable by the current Runnable Symphony flow: + +- scoped narrowly enough for one agent run, +- concrete enough to reproduce, +- explicit about acceptance criteria, +- explicit about validation, +- safe to start from `Todo` and move through `In Progress` -> `Human Review` -> `Merging` -> `Done`. + +When the user asks to actually create the issue in Linear, prefer doing that in-session if a Linear MCP tool or Symphony `linear_graphql` tool is available. + +## Runnable-Specific Rules + +- Prefer one issue per concrete outcome. Do not pack multiple experiments or refactors into one ticket. +- Write for this repo's real workflows: `scripts/`, `docs/exp/`, `docs/reference/`, `tests/`, `openspec/changes/`, and libcrypto validation/eval paths. +- If the task is exploratory, define the expected artifact clearly: for example a report under `docs/exp/...`, a script under `scripts/...`, or a bounded code change plus validation. +- If the task could sprawl, split it now and keep the current issue to the smallest reviewable slice. +- If the task is discovered while doing other work and is out of scope, prefer the follow-up skill `runnable-symphony-followup`. + +## Required Structure + +Unless the user asks for a different format, draft the issue with these sections: + +```md +## Summary + +<1 short paragraph describing the problem and intended outcome> + +## Problem + +- +- +- + +## Scope + +- In scope: +- Out of scope: + +## Acceptance Criteria + +- [ ] +- [ ] + +## Validation + +- [ ] +- [ ] + +## Notes + +- +``` + +## Writing Guidance + +- `Summary` should say what will be true after the issue is complete. +- `Problem` should describe the current failure mode, missing capability, or quality gap. +- `Scope` should constrain the execution so Symphony does not grow the task during implementation. +- `Acceptance Criteria` must be reviewer-visible outcomes, not implementation steps. +- `Validation` must name concrete commands, scripts, or doc artifacts whenever possible. +- `Notes` is optional and should hold paths, run IDs, comparison baselines, or dependencies. + +## State Guidance + +- New work intended for immediate execution should usually be created in `Todo`. +- Follow-up work discovered during implementation should usually be created in `Backlog`. +- Do not tell Symphony to start from `Backlog`; `WORKFLOW.md` explicitly treats it as out of scope until a human moves it. + +## Submission Mode + +If the user asks to "submit", "create", or "file" the issue: + +1. Draft the final title and body first. +2. If a Linear tool is available, create the issue instead of stopping at prose. +3. If no Linear tool is available, return a ready-to-paste title/body/state package. + +Prefer these target states: + +- `Todo` for new work intended to be executed soon by Symphony. +- `Backlog` only when the user explicitly wants deferred work or the task is an out-of-scope follow-up. + +## Linear / Symphony Guardrails + +- Treat Runnable's tracker project slug as `runnable-e97c680b3b79`. +- If you are already inside a Symphony issue session, prefer reusing the current issue's `project.id` and `team.id` when creating a sibling or follow-up issue. +- If you have the current issue context, use the verified `ResolveStateId` and `CreateIssue` GraphQL patterns from `WORKFLOW.md`. +- `WORKFLOW.md` does not define a verified GraphQL mutation for issue relations, so treat `related` / `blockedBy` creation as best-effort via trusted tooling only. +- If you do not have current issue context and do not have a trusted project/team lookup helper in-session, do not invent unknown GraphQL schema fields just to force submission. Fall back to a ready-to-submit draft. +- If the user asks for links or dependencies between issues, add the relation only when the available Linear tooling already exposes a verified way to do it in-session. + +## Good Runnable Issue Shapes + +- Tight code fix plus a targeted test. +- One bounded experiment run plus a report artifact. +- One metrics or validation improvement with explicit before/after proof. +- One documentation or workflow correction tied to a concrete misleading behavior. + +## Avoid + +- Multi-week epics disguised as one issue. +- Acceptance criteria like "investigate" with no artifact. +- Validation like "make sure it works" with no command or output path. +- Mixing implementation, benchmark campaign, paper-writing, and cleanup into one ticket. +- Telling the future agent to ask humans for missing details unless there is a real auth or permission blocker. + +## Runnable Defaults + +When the user gives only a rough intent, bias toward: + +- clear file/path anchors, +- reproducible commands, +- explicit output artifacts, +- narrow scope that can plausibly complete in one Symphony ticket cycle. + +If helpful, also read: + +- `WORKFLOW.md` for the active state machine and workpad rules. +- [issue-templates.md](references/issue-templates.md) for ready-to-use Runnable ticket templates. +- [runnable-linear-context.md](references/runnable-linear-context.md) for repo-specific Linear/Symphony defaults. diff --git a/.codex/skills/runnable-symphony-issue/agents/openai.yaml b/.codex/skills/runnable-symphony-issue/agents/openai.yaml new file mode 100644 index 000000000..184b071c9 --- /dev/null +++ b/.codex/skills/runnable-symphony-issue/agents/openai.yaml @@ -0,0 +1,4 @@ +interface: + display_name: "Runnable Symphony Issue" + short_description: "为 Runnable 写可被 Symphony 执行的 Linear issue" + default_prompt: "Use $runnable-symphony-issue to draft or rewrite a Runnable Linear issue so it matches the current Symphony workflow in WORKFLOW.md, including clear scope, acceptance criteria, and validation." diff --git a/.codex/skills/runnable-symphony-issue/references/issue-templates.md b/.codex/skills/runnable-symphony-issue/references/issue-templates.md new file mode 100644 index 000000000..a36ad22b4 --- /dev/null +++ b/.codex/skills/runnable-symphony-issue/references/issue-templates.md @@ -0,0 +1,106 @@ +# Runnable Symphony Issue Templates + +Use these templates when the user wants a draft quickly. Adapt them to the specific task; do not leave placeholders vague. + +## 1. Code Fix + +```md +## Summary + +Fix in `` so that . + +## Problem + +- Current behavior: +- Evidence: +- Impact: + +## Scope + +- In scope: fix the behavior in `` +- In scope: add or update targeted coverage for the regression +- Out of scope: unrelated refactors or broader optimization + +## Acceptance Criteria + +- [ ] +- [ ] regression coverage exists for the failure mode + +## Validation + +- [ ] `` +- [ ] `` + +## Notes + +- Related paths: ``, `` +``` + +## 2. Experiment / Validation Run + +```md +## Summary + +Run a bounded experiment for and record the result in ``. + +## Problem + +- We currently do not know whether holds. +- Existing evidence: + +## Scope + +- In scope: run +- In scope: summarize metrics and interpretation in `` +- Out of scope: unrelated repair work unless required to complete the planned run + +## Acceptance Criteria + +- [ ] experiment completes or fails with a clearly documented blocker +- [ ] resulting metrics are written to `` +- [ ] summary states whether the hypothesis was supported + +## Validation + +- [ ] `` +- [ ] `test -f ` or equivalent artifact existence check +``` + +## 3. Documentation / Workflow Fix + +```md +## Summary + +Correct `` so it matches the current Runnable behavior for . + +## Problem + +- Current documentation or workflow text is misleading in `` +- Evidence: + +## Scope + +- In scope: update the relevant docs or workflow text +- In scope: align example commands and expected artifacts +- Out of scope: changing runtime behavior unless required for correctness + +## Acceptance Criteria + +- [ ] the corrected doc/workflow matches current commands and paths +- [ ] stale or misleading guidance is removed or replaced + +## Validation + +- [ ] review `` against `` +- [ ] if commands are changed, run the command with a safe dry-run or equivalent proof +``` + +## 4. Issue Rewrite + +Use when the user already has a rough draft and wants it cleaned up: + +1. Keep the original intent. +2. Replace vague verbs with observable outcomes. +3. Move implementation details out of `Acceptance Criteria` and into `Scope` or `Notes`. +4. Add concrete `Validation` items. +5. Split the issue if it still contains more than one independently reviewable outcome. diff --git a/.codex/skills/runnable-symphony-issue/references/runnable-linear-context.md b/.codex/skills/runnable-symphony-issue/references/runnable-linear-context.md new file mode 100644 index 000000000..071d0060c --- /dev/null +++ b/.codex/skills/runnable-symphony-issue/references/runnable-linear-context.md @@ -0,0 +1,32 @@ +# Runnable Linear / Symphony Context + +Repo-specific defaults extracted from `WORKFLOW.md`: + +- Tracker kind: `linear` +- Tracker project slug: `runnable-e97c680b3b79` +- Active states: + - `Todo` + - `In Progress` + - `Human Review` + - `Merging` + - `Rework` +- Terminal states: + - `Closed` + - `Cancelled` + - `Canceled` + - `Duplicate` + - `Done` + +Issue creation defaults: + +- Use `Todo` for fresh work intended for Symphony execution. +- Use `Backlog` for follow-up or deferred work. +- Keep issue body concise and reviewer-oriented. +- Include explicit `Acceptance Criteria` and `Validation`. + +If creating a follow-up from an active issue: + +- keep it in the same project, +- add `related` / `blockedBy` relations only when the available Linear tool exposes a verified relation operation, +- otherwise record the intended linkage explicitly in the issue body or workpad, +- do not expand the current ticket just because related work was discovered. diff --git a/WORKFLOW.md b/WORKFLOW.md new file mode 100644 index 000000000..2a6be6ed0 --- /dev/null +++ b/WORKFLOW.md @@ -0,0 +1,455 @@ +--- +tracker: + kind: linear + project_slug: "runnable-e97c680b3b79" + active_states: + - Todo + - In Progress + - Human Review + - Merging + - Rework + terminal_states: + - Closed + - Cancelled + - Canceled + - Duplicate + - Done +polling: + interval_ms: 30000 +workspace: + root: /hdd/code/runnable +hooks: + after_create: | + git clone --depth 1 git@github.com:GRIN2021/Runnable-Rewriting.git . + before_remove: null +agent: + max_concurrent_agents: 10 + max_turns: 6 +codex: + command: codex --dangerously-bypass-approvals-and-sandbox --config shell_environment_policy.inherit=all --config 'model="gpt-5.4"' --config model_reasoning_effort=xhigh app-server + approval_policy: never + thread_sandbox: danger-full-access + turn_sandbox_policy: + type: dangerFullAccess +--- + +You are working on a Linear ticket `{{ issue.identifier }}` + +{% if attempt %} +Continuation context: + +- This is retry attempt #{{ attempt }} because the ticket is still in an active state. +- Resume from the current workspace state instead of restarting from scratch. +- Do not repeat already-completed investigation or validation unless needed for new code changes. +- Do not end the turn while the issue remains in an active state unless you are blocked by missing required permissions/secrets. + {% endif %} + +Issue context: +Identifier: {{ issue.identifier }} +Title: {{ issue.title }} +Current status: {{ issue.state }} +Labels: {{ issue.labels }} +URL: {{ issue.url }} + +Description: +{% if issue.description %} +{{ issue.description }} +{% else %} +No description provided. +{% endif %} + +Instructions: + +1. This is an unattended orchestration session. Never ask a human to perform follow-up actions. +2. Only stop early for a true blocker (missing required auth/permissions/secrets). If blocked, record it in the workpad and move the issue according to workflow. +3. Final message must report completed actions and blockers only. Do not include "next steps for user". + +Work only in the provided repository copy. Do not touch any other path. + +## Prerequisite: Linear MCP or `linear_graphql` tool is available + +The agent should be able to talk to Linear, either via a configured Linear MCP server or injected `linear_graphql` tool. If none are present, stop and ask the user to configure Linear. + +## Linear GraphQL Guardrails + +When using `linear_graphql`, follow these rules exactly: + +- Treat `{{ issue.id }}` as the canonical current issue ID. +- Treat `{{ issue.identifier }}` only as human-readable display text. Do not use `identifier` inside `IssueFilter`. +- Treat `tracker.project_slug` as a Linear `slugId`. When filtering by project, use `project: { slugId: { eq: ... } }`, not `slug`. +- Prefer `issue(id: $issueId)` over broad list queries whenever you are operating on the current ticket. +- Do not introspect the full schema. Use the fixed query and mutation templates below. + +Use these exact GraphQL patterns: + +Current issue and project/team context: + +```graphql +query CurrentIssue($issueId: String!) { + issue(id: $issueId) { + id + identifier + title + state { + id + name + } + project { + id + name + slugId + } + team { + id + name + } + } +} +``` + +Find a state ID by name for the current issue's team: + +```graphql +query ResolveStateId($issueId: String!, $stateName: String!) { + issue(id: $issueId) { + team { + states(filter: { name: { eq: $stateName } }, first: 1) { + nodes { + id + name + } + } + } + } +} +``` + +Read current issue comments when searching for `## Codex Workpad`: + +```graphql +query IssueComments($issueId: String!) { + issue(id: $issueId) { + comments(first: 50) { + nodes { + id + body + createdAt + } + } + } +} +``` + +Move the current issue to a new state: + +```graphql +mutation UpdateIssueState($issueId: String!, $stateId: String!) { + issueUpdate(id: $issueId, input: { stateId: $stateId }) { + success + } +} +``` + +Create or update the persistent workpad comment: + +```graphql +mutation CreateComment($issueId: String!, $body: String!) { + commentCreate(input: { issueId: $issueId, body: $body }) { + success + } +} +``` + +Create follow-up issues in the same team/project: + +```graphql +mutation CreateIssue( + $teamId: String! + $projectId: String! + $stateId: String! + $title: String! + $description: String! +) { + issueCreate( + input: { + teamId: $teamId + projectId: $projectId + stateId: $stateId + title: $title + description: $description + } + ) { + success + issue { + id + identifier + url + } + } +} +``` + +Do not invent alternate filter fields or alternate mutation shapes when the templates above fit the task. + +No verified GraphQL mutation for issue relations is defined in this workflow. +If a follow-up should be linked as `related` or `blockedBy`, only create that +relation through a trusted Linear tool that already exposes a verified +operation; otherwise create the follow-up issue and record the intended linkage +explicitly in the issue body or workpad. + +## Default posture + +- Start by determining the ticket's current status, then follow the matching flow for that status. +- Start every task by opening the tracking workpad comment and bringing it up to date before doing new implementation work. +- Spend extra effort up front on planning and verification design before implementation. +- Reproduce first: always confirm the current behavior/issue signal before changing code so the fix target is explicit. +- Keep ticket metadata current (state, checklist, acceptance criteria, links). +- Treat a single persistent Linear comment as the source of truth for progress. +- Use that single workpad comment for all progress and handoff notes; do not post separate "done"/summary comments. +- Treat any ticket-authored `Validation`, `Test Plan`, or `Testing` section as non-negotiable acceptance input: mirror it in the workpad and execute it before considering the work complete. +- When meaningful out-of-scope improvements are discovered during execution, + file a separate Linear issue instead of expanding scope. The follow-up issue + must include a clear title, description, and acceptance criteria, be placed in + `Backlog`, and be assigned to the same project as the current issue. +- If the available Linear tool exposes a verified relation operation, link the + current issue as `related` and use `blockedBy` when the follow-up depends on + the current issue. +- Otherwise, record the intended `related` / `blockedBy` linkage explicitly in + the follow-up issue body or workpad and do not guess GraphQL relation schema. +- Move status only when the matching quality bar is met. +- Operate autonomously end-to-end unless blocked by missing requirements, secrets, or permissions. +- Use the blocked-access escape hatch only for true external blockers (missing required tools/auth) after exhausting documented fallbacks. + +## Related skills + +- `linear`: interact with Linear. +- `commit`: produce clean, logical commits during implementation. +- `push`: keep remote branch current and publish updates. +- `pull`: keep branch updated with latest `origin/main` before handoff. +- `land`: when ticket reaches `Merging`, explicitly open and follow `.codex/skills/land/SKILL.md`, which includes the `land` loop. + +## Status map + +- `Backlog` -> out of scope for this workflow; do not modify. +- `Todo` -> queued; immediately transition to `In Progress` before active work. + - Special case: if a PR is already attached, treat as feedback/rework loop (run full PR feedback sweep, address or explicitly push back, revalidate, return to `Human Review`). +- `In Progress` -> implementation actively underway. +- `Human Review` -> PR is attached and validated; waiting on human approval. +- `Merging` -> approved by human; execute the `land` skill flow (do not call `gh pr merge` directly). +- `Rework` -> reviewer requested changes; planning + implementation required. +- `Done` -> terminal state; no further action required. + +## Step 0: Determine current ticket state and route + +1. Fetch the issue by explicit ticket ID. +2. Read the current state. +3. Route to the matching flow: + - `Backlog` -> do not modify issue content/state; stop and wait for human to move it to `Todo`. + - `Todo` -> immediately move to `In Progress`, then ensure bootstrap workpad comment exists (create if missing), then start execution flow. + - If PR is already attached, start by reviewing all open PR comments and deciding required changes vs explicit pushback responses. + - `In Progress` -> continue execution flow from current scratchpad comment. + - `Human Review` -> wait and poll for decision/review updates. + - `Merging` -> on entry, open and follow `.codex/skills/land/SKILL.md`; do not call `gh pr merge` directly. + - `Rework` -> run rework flow. + - `Done` -> do nothing and shut down. +4. Check whether a PR already exists for the current branch and whether it is closed. + - If a branch PR exists and is `CLOSED` or `MERGED`, treat prior branch work as non-reusable for this run. + - Create a fresh branch from `origin/main` and restart execution flow as a new attempt. +5. For `Todo` tickets, do startup sequencing in this exact order: + - `update_issue(..., state: "In Progress")` + - find/create `## Codex Workpad` bootstrap comment + - only then begin analysis/planning/implementation work. +6. Add a short comment if state and issue content are inconsistent, then proceed with the safest flow. + +## Step 1: Start/continue execution (Todo or In Progress) + +1. Find or create a single persistent scratchpad comment for the issue: + - Search existing comments for a marker header: `## Codex Workpad`. + - Ignore resolved comments while searching; only active/unresolved comments are eligible to be reused as the live workpad. + - If found, reuse that comment; do not create a new workpad comment. + - If not found, create one workpad comment and use it for all updates. + - Persist the workpad comment ID and only write progress updates to that ID. +2. If arriving from `Todo`, do not delay on additional status transitions: the issue should already be `In Progress` before this step begins. +3. Immediately reconcile the workpad before new edits: + - Check off items that are already done. + - Expand/fix the plan so it is comprehensive for current scope. + - Ensure `Acceptance Criteria` and `Validation` are current and still make sense for the task. +4. Start work by writing/updating a hierarchical plan in the workpad comment. +5. Ensure the workpad includes a compact environment stamp at the top as a code fence line: + - Format: `:@` + - Example: `devbox-01:/home/dev-user/code/symphony-workspaces/MT-32@7bdde33bc` + - Do not include metadata already inferable from Linear issue fields (`issue ID`, `status`, `branch`, `PR link`). +6. Add explicit acceptance criteria and TODOs in checklist form in the same comment. + - If changes are user-facing, include a UI walkthrough acceptance criterion that describes the end-to-end user path to validate. + - If changes touch app files or app behavior, add explicit app-specific flow checks to `Acceptance Criteria` in the workpad (for example: launch path, changed interaction path, and expected result path). + - If the ticket description/comment context includes `Validation`, `Test Plan`, or `Testing` sections, copy those requirements into the workpad `Acceptance Criteria` and `Validation` sections as required checkboxes (no optional downgrade). +7. Run a principal-style self-review of the plan and refine it in the comment. +8. Before implementing, capture a concrete reproduction signal and record it in the workpad `Notes` section (command/output, screenshot, or deterministic UI behavior). +9. Run the `pull` skill to sync with latest `origin/main` before any code edits, then record the pull/sync result in the workpad `Notes`. + - Include a `pull skill evidence` note with: + - merge source(s), + - result (`clean` or `conflicts resolved`), + - resulting `HEAD` short SHA. +10. Compact context and proceed to execution. + +## PR feedback sweep protocol (required) + +When a ticket has an attached PR, run this protocol before moving to `Human Review`: + +1. Identify the PR number from issue links/attachments. +2. Gather feedback from all channels: + - Top-level PR comments (`gh pr view --comments`). + - Inline review comments (`gh api repos///pulls//comments`). + - Review summaries/states (`gh pr view --json reviews`). +3. Treat every actionable reviewer comment (human or bot), including inline review comments, as blocking until one of these is true: + - code/test/docs updated to address it, or + - explicit, justified pushback reply is posted on that thread. +4. Update the workpad plan/checklist to include each feedback item and its resolution status. +5. Re-run validation after feedback-driven changes and push updates. +6. Repeat this sweep until there are no outstanding actionable comments. + +## Blocked-access escape hatch (required behavior) + +Use this only when completion is blocked by missing required tools or missing auth/permissions that cannot be resolved in-session. + +- GitHub is **not** a valid blocker by default. Always try fallback strategies first (alternate remote/auth mode, then continue publish/review flow). +- Do not move to `Human Review` for GitHub access/auth until all fallback strategies have been attempted and documented in the workpad. +- If a non-GitHub required tool is missing, or required non-GitHub auth is unavailable, move the ticket to `Human Review` with a short blocker brief in the workpad that includes: + - what is missing, + - why it blocks required acceptance/validation, + - exact human action needed to unblock. +- Keep the brief concise and action-oriented; do not add extra top-level comments outside the workpad. + +## Step 2: Execution phase (Todo -> In Progress -> Human Review) + +1. Determine current repo state (`branch`, `git status`, `HEAD`) and verify the kickoff `pull` sync result is already recorded in the workpad before implementation continues. +2. If current issue state is `Todo`, move it to `In Progress`; otherwise leave the current state unchanged. +3. Load the existing workpad comment and treat it as the active execution checklist. + - Edit it liberally whenever reality changes (scope, risks, validation approach, discovered tasks). +4. Implement against the hierarchical TODOs and keep the comment current: + - Check off completed items. + - Add newly discovered items in the appropriate section. + - Keep parent/child structure intact as scope evolves. + - Update the workpad immediately after each meaningful milestone (for example: reproduction complete, code change landed, validation run, review feedback addressed). + - Never leave completed work unchecked in the plan. + - For tickets that started as `Todo` with an attached PR, run the full PR feedback sweep protocol immediately after kickoff and before new feature work. +5. Run validation/tests required for the scope. + - Mandatory gate: execute all ticket-provided `Validation`/`Test Plan`/ `Testing` requirements when present; treat unmet items as incomplete work. + - Prefer a targeted proof that directly demonstrates the behavior you changed. + - You may make temporary local proof edits to validate assumptions (for example: tweak a local build input for `make`, or hardcode a UI account / response path) when this increases confidence. + - Revert every temporary proof edit before commit/push. + - Document these temporary proof steps and outcomes in the workpad `Validation`/`Notes` sections so reviewers can follow the evidence. + - If app-touching, run `launch-app` validation and capture/upload media via `github-pr-media` before handoff. +6. Re-check all acceptance criteria and close any gaps. +7. Before every `git push` attempt, run the required validation for your scope and confirm it passes; if it fails, address issues and rerun until green, then commit and push changes. +8. Attach PR URL to the issue (prefer attachment; use the workpad comment only if attachment is unavailable). + - Ensure the GitHub PR has label `symphony` (add it if missing). +9. Merge latest `origin/main` into branch, resolve conflicts, and rerun checks. +10. Update the workpad comment with final checklist status and validation notes. + - Mark completed plan/acceptance/validation checklist items as checked. + - Add final handoff notes (commit + validation summary) in the same workpad comment. + - Do not include PR URL in the workpad comment; keep PR linkage on the issue via attachment/link fields. + - Add a short `### Confusions` section at the bottom when any part of task execution was unclear/confusing, with concise bullets. + - Do not post any additional completion summary comment. +11. Before moving to `Human Review`, poll PR feedback and checks: + - Read the PR `Manual QA Plan` comment (when present) and use it to sharpen UI/runtime test coverage for the current change. + - Run the full PR feedback sweep protocol. + - Confirm PR checks are passing (green) after the latest changes. + - Confirm every required ticket-provided validation/test-plan item is explicitly marked complete in the workpad. + - Repeat this check-address-verify loop until no outstanding comments remain and checks are fully passing. + - Re-open and refresh the workpad before state transition so `Plan`, `Acceptance Criteria`, and `Validation` exactly match completed work. +12. Only then move issue to `Human Review`. + - Exception: if blocked by missing required non-GitHub tools/auth per the blocked-access escape hatch, move to `Human Review` with the blocker brief and explicit unblock actions. +13. For `Todo` tickets that already had a PR attached at kickoff: + - Ensure all existing PR feedback was reviewed and resolved, including inline review comments (code changes or explicit, justified pushback response). + - Ensure branch was pushed with any required updates. + - Then move to `Human Review`. + +## Step 3: Human Review and merge handling + +1. When the issue is in `Human Review`, do not code or change ticket content. +2. Poll for updates as needed, including GitHub PR review comments from humans and bots. +3. If review feedback requires changes, move the issue to `Rework` and follow the rework flow. +4. If approved, human moves the issue to `Merging`. +5. When the issue is in `Merging`, open and follow `.codex/skills/land/SKILL.md`, then run the `land` skill in a loop until the PR is merged. Do not call `gh pr merge` directly. +6. After merge is complete, move the issue to `Done`. + +## Step 4: Rework handling + +1. Treat `Rework` as a full approach reset, not incremental patching. +2. Re-read the full issue body and all human comments; explicitly identify what will be done differently this attempt. +3. Close the existing PR tied to the issue. +4. Remove the existing `## Codex Workpad` comment from the issue. +5. Create a fresh branch from `origin/main`. +6. Start over from the normal kickoff flow: + - If current issue state is `Todo`, move it to `In Progress`; otherwise keep the current state. + - Create a new bootstrap `## Codex Workpad` comment. + - Build a fresh plan/checklist and execute end-to-end. + +## Completion bar before Human Review + +- Step 1/2 checklist is fully complete and accurately reflected in the single workpad comment. +- Acceptance criteria and required ticket-provided validation items are complete. +- Validation/tests are green for the latest commit. +- PR feedback sweep is complete and no actionable comments remain. +- PR checks are green, branch is pushed, and PR is linked on the issue. +- Required PR metadata is present (`symphony` label). +- If app-touching, runtime validation/media requirements from `App runtime validation (required)` are complete. + +## Guardrails + +- If the branch PR is already closed/merged, do not reuse that branch or prior implementation state for continuation. +- For closed/merged branch PRs, create a new branch from `origin/main` and restart from reproduction/planning as if starting fresh. +- If issue state is `Backlog`, do not modify it; wait for human to move to `Todo`. +- Do not edit the issue body/description for planning or progress tracking. +- Use exactly one persistent workpad comment (`## Codex Workpad`) per issue. +- If comment editing is unavailable in-session, use the update script. Only report blocked if both MCP editing and script-based editing are unavailable. +- Temporary proof edits are allowed only for local verification and must be reverted before commit. +- If out-of-scope improvements are found, create a separate Backlog issue rather + than expanding current scope, and include a clear + title/description/acceptance criteria and same-project assignment. +- Add `related` / `blockedBy` relations in-session only when the available + Linear tool exposes a verified relation operation. +- Otherwise, record the intended linkage explicitly in the follow-up issue body + or workpad instead of guessing relation mutation syntax. +- Do not move to `Human Review` unless the `Completion bar before Human Review` is satisfied. +- In `Human Review`, do not make changes; wait and poll. +- If state is terminal (`Done`), do nothing and shut down. +- Keep issue text concise, specific, and reviewer-oriented. +- If blocked and no workpad exists yet, add one blocker comment describing blocker, impact, and next unblock action. + +## Workpad template + +Use this exact structure for the persistent workpad comment and keep it updated in place throughout execution: + +````md +## Codex Workpad + +```text +:@ +``` + +### Plan + +- [ ] 1\. Parent task + - [ ] 1.1 Child task + - [ ] 1.2 Child task +- [ ] 2\. Parent task + +### Acceptance Criteria + +- [ ] Criterion 1 +- [ ] Criterion 2 + +### Validation + +- [ ] targeted tests: `` + +### Notes + +- + +### Confusions + +- +```` diff --git a/docs/exp/2026-05-10-libcrypto-canonical-eval-contract.md b/docs/exp/2026-05-10-libcrypto-canonical-eval-contract.md new file mode 100644 index 000000000..7d109d729 --- /dev/null +++ b/docs/exp/2026-05-10-libcrypto-canonical-eval-contract.md @@ -0,0 +1,326 @@ +# 2026-05-10 libcrypto canonical eval contract + +## Goal + +修复 `SYM-20` 里 `libcrypto` 的评估口径错配,把当前 repo-local authoritative contract、历史 non-canonical 路径、以及本工作区可直接验证的结果拆开记录清楚。 + +## Inputs + +- Canonical binary: + - `archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.so.3` +- Canonical GT: + - `archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.gtBlock.pb` +- Historical old-binary baseline: + - `archives/experiments/libcrypto_master_test_20260414/libcrypto.so.3` + - `runs/validate-libcrypto-llm-agent-current/baseline/cmp.repaired.json` + - `runs/validate-libcrypto-llm-agent-current/baseline/summary.json` +- Historical SYM-19 note: + - `/hdd/code/runnable/SYM-19/docs/exp/2026-05-02-sym-19-libcrypto-eval.md` +- Canonical new-GT serial run: + - `runs/runnable-dev-2026-0429-newgt-serial/libcrypto.so.3` + - `runs/runnable-dev-2026-0429-newgt-serial/libcrypto.so.3.entry_0x500cf000.ll` + +## Environment + +- Workspace root: `/home/iskindar/Project/runnable` +- Canonical path resolver: + - `scripts/libcrypto_bench_paths.py` +- Canonical GT audit / compare wrapper: + - `scripts/validate_libcrypto_ground_truth.py` +- Fresh compare wrapper: + - `.codex/skills/runnable-cmp-eval/scripts/run_cmp_eval.py` + +## Commands + +Resolve canonical assets: + +```bash +python3 scripts/libcrypto_bench_paths.py binary --must-exist +python3 scripts/libcrypto_bench_paths.py groundtruth-pb --must-exist +``` + +Gap-audit canonical GT coverage: + +```bash +python3 scripts/validate_libcrypto_ground_truth.py gap-audit \ + --out-dir runs/groundtruth_validation/canonical_gap_audit +``` + +Show that the old baseline `.ll` is not comparable to canonical GT: + +```bash +python3 scripts/validate_libcrypto_ground_truth.py cmp \ + --ll runs/validate-libcrypto-llm-agent-current/docker_exec/baseline/libcrypto.so.3.entry_0x500cf000.ll \ + --out-dir runs/groundtruth_validation/canonical_baseline_cmp \ + --examples 10 +``` + +Re-run a comparable canonical baseline using the repo-local `newgt-serial` artifact: + +```bash +python3 scripts/validate_libcrypto_ground_truth.py cmp \ + --binary runs/runnable-dev-2026-0429-newgt-serial/libcrypto.so.3 \ + --groundtruth archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.gtBlock.pb \ + --ll runs/runnable-dev-2026-0429-newgt-serial/libcrypto.so.3.entry_0x500cf000.ll \ + --out-dir runs/groundtruth_validation/newgt_serial_cmp \ + --examples 10 +``` + +Binary identity checks: + +```bash +sha256sum \ + /hdd/code/runnable/SYM-19/test/openssl_data/libcrypto.so.3 \ + archives/experiments/libcrypto_master_test_20260414/libcrypto.so.3 \ + archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.so.3 + +readelf -WS archives/experiments/libcrypto_master_test_20260414/libcrypto.so.3 +readelf -WS archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.so.3 +``` + +## Artifacts + +- Canonical GT gap audit: + - `runs/groundtruth_validation/canonical_gap_audit/gap.summary.json` + - `runs/groundtruth_validation/canonical_gap_audit/gap.summary.txt` +- Canonical compare against old baseline `.ll`: + - `runs/groundtruth_validation/canonical_baseline_cmp/cmp.json` + - `runs/groundtruth_validation/canonical_baseline_cmp/cmp.verdict.txt` +- Canonical comparable serial baseline: + - `runs/groundtruth_validation/newgt_serial_cmp/cmp.json` + - `runs/groundtruth_validation/newgt_serial_cmp/cmp.txt` + - `runs/groundtruth_validation/newgt_serial_cmp/cmp.verdict.txt` +- Canonical textstart baseline: + - `runs/groundtruth_validation/textstart_serial_cmp/cmp.json` + - `runs/groundtruth_validation/textstart_serial_cmp/cmp.txt` + - `runs/groundtruth_validation/textstart_serial_cmp/cmp.verdict.txt` +- Canonical textstart optimized: + - `runs/groundtruth_validation/textstart_dynamic_cmp/cmp.json` + - `runs/groundtruth_validation/textstart_dynamic_cmp/cmp.verdict.txt` +- Historical old-binary baseline: + - `runs/validate-libcrypto-llm-agent-current/baseline/cmp.repaired.json` + - `runs/validate-libcrypto-llm-agent-current/baseline/summary.json` + +## Results + +### Canonical authoritative contract + +- Binary: + - `archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.so.3` + - sha256 `2d4faaa94bb53b5f92a7d8d0b581eea1ad0446c30f34a8ddf4baef713f744d04` +- GT: + - `archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.gtBlock.pb` +- Address mapping: + - ELF `.text` start: `0xcef80` + - Runnable base: `0x50000000` +- Compare path: + - fresh `runnable-cmp-eval` + - do not use cached `.result` + - do not use `coverage_sidecar_union_threadpool16` as final authority + +### Canonical GT coverage caveat + +`gap-audit` shows the current `gtBlock.pb` bundle is not a byte-for-byte mirror of `objdump -d -j .text`: + +- `objdump_real_instruction_count=707311` +- `groundtruth_instruction_count=670703` +- `unseen_instruction_count=36610` +- `unseen_ratio_over_groundtruth=0.05458451803555374` +- `instruction_category_counts.outside_gt_coverage=35257` +- `instruction_category_counts.padding=1353` + +This caveat must be reported with canonical results instead of assuming GT and objdump are identical. + +### Why the SYM-19 `recall=0.445502995810004` result is not directly comparable + +`SYM-19` used a different binary, different GT representation, different address contract, and different evaluator: + +1. Binary generation mismatch: + - `/hdd/code/runnable/SYM-19/test/openssl_data/libcrypto.so.3` + - `archives/experiments/libcrypto_master_test_20260414/libcrypto.so.3` + - both share sha256 `932923d4498c83f75c60a7a02404267297df9f7a2844ed6f3f8d154041b558db` + - canonical `2026-04-28` binary has sha256 `2d4faaa94bb53b5f92a7d8d0b581eea1ad0446c30f34a8ddf4baef713f744d04` +2. `.text` layout mismatch: + - old binary `.text` starts at `0xcf000` + - canonical binary `.text` starts at `0xcef80` +3. GT representation mismatch: + - `SYM-19`: CSV GT bundle under `out/sym-19/gt/ground_truth.csv` + - canonical: `gtBlock.pb` +4. Evaluator mismatch: + - `SYM-19`: `coverage_sidecar_union_threadpool16` + - canonical: fresh `runnable-cmp-eval` +5. Address-domain mismatch: + - `SYM-19`: `rebase_base=0x50400000` + - canonical compare: `runnable_base=0x50000000` + +So `SYM-19`’s: + +- baseline `precision=0.9885320776927604`, `recall=0.445502995810004` +- optimized `precision=0.9885531253892196`, `recall=0.4463316579715496` + +must be labeled `historical sidecar-union / non-canonical`, not compared numerically against canonical `gtBlock.pb` results. + +### Historical old-binary baseline under the old contract + +From `runs/validate-libcrypto-llm-agent-current/baseline/cmp.repaired.json`: + +- binary: old `2026-04-14` binary +- `.text` start: `0xcf000` +- `runnable_base=0x50000000` +- `obj_count=932388` +- `ll_count=847589` +- `HIT=787564` +- `MISMATCH=1414` +- `OBJ_ONLY=143410` +- `LL_ONLY=58611` +- `FALSE_NEGATIVE=144824` +- `FALSE_POSITIVE=60025` +- `precision=0.929181478287236` +- `recall=0.8446741056298451` + +This is a valid old-binary compare, but not the `SYM-20` canonical authority. + +### Proof that the old baseline `.ll` fails under the canonical contract + +Comparing the old baseline `.ll` against canonical `2026-04-28` GT produces a catastrophic mismatch: + +- `obj_count=679506` +- `ll_count=847618` +- `HIT=24175` +- `MISMATCH=112631` +- `OBJ_ONLY=542700` +- `LL_ONLY=710812` +- `FALSE_NEGATIVE=655331` +- `FALSE_POSITIVE=823443` +- `precision=0.028521102666531385` +- `recall=0.03557731646225346` + +This is the direct proof that “old `.ll` + canonical GT” is not a legitimate comparison target. + +### Corrected comparable baseline available in the current workspace + +The current workspace does contain one directly comparable canonical baseline artifact: + +- binary: `runs/runnable-dev-2026-0429-newgt-serial/libcrypto.so.3` +- ll: `runs/runnable-dev-2026-0429-newgt-serial/libcrypto.so.3.entry_0x500cf000.ll` +- binary sha256 matches canonical `2026-04-28` GT binary + +Fresh canonical compare result: + +- `obj_count=679506` +- `ll_count=593206` +- `HIT=532849` +- `MISMATCH=1466` +- `OBJ_ONLY=145191` +- `LL_ONLY=58891` +- `FALSE_NEGATIVE=146657` +- `FALSE_POSITIVE=60357` +- `precision=0.8982528834839837` +- `recall=0.7841711478633007` + +### Corrected comparable baseline / optimized pair for SYM-20 + +The current workspace also contains one full corrected pair that satisfies the ticket's comparability rule: + +- same canonical binary: + - `archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.so.3` +- same canonical GT: + - `archives/groundtruth/libcrypto_groudtruth_20260428/libcrypto.gtBlock.pb` +- same address mapping: + - `.text_start=0xcef80` + - `runnable_base=0x50000000` +- same compare path: + - fresh `runnable-cmp-eval` through `scripts/validate_libcrypto_ground_truth.py cmp` +- same artifact contract: + - `cmp.json` + - `cmp.txt` + - `cmp.verdict.txt` + +Baseline: + +- ll: `runs/runnable-dev-2026-0429-textstart-serial/libcrypto.so.3.entry_0x500cef80.ll` +- `obj_count=679506` +- `ll_count=554051` +- `HIT=527641` +- `MISMATCH=1044` +- `OBJ_ONLY=150821` +- `LL_ONLY=25366` +- `FALSE_NEGATIVE=151865` +- `FALSE_POSITIVE=26410` +- `precision=0.9523329079813952` +- `recall=0.7765067563788988` + +Optimized: + +- ll: `runs/runnable-dev-2026-0501-textstart-dynamic-reapfix/libcrypto.so.3.entry_0x500cef80.dynamic.ll` +- `obj_count=679506` +- `ll_count=569562` +- `HIT=529848` +- `MISMATCH=1217` +- `OBJ_ONLY=148441` +- `LL_ONLY=38497` +- `FALSE_NEGATIVE=149658` +- `FALSE_POSITIVE=39714` +- `precision=0.9302727358917905` +- `recall=0.7797547041527227` + +Delta (`optimized - baseline`): + +- `HIT=+2207` +- `MISMATCH=+173` +- `OBJ_ONLY=-2380` +- `LL_ONLY=+13131` +- `FALSE_NEGATIVE=-2207` +- `FALSE_POSITIVE=+13304` +- `precision=-0.022060172089604757` +- `recall=+0.003247947773823978` + +Verdict under the corrected canonical contract: + +- recall `improved` +- precision `regressed` +- overall verdict: `mixed`, not a clean improvement + +### Remaining evidence gap + +I still could not find the raw artifact behind the older historical claim: + +- accepted baseline `precision=0.960596`, `recall=0.845340` +- historical llm `precision=0.966030`, `recall=0.841228` + +The historical “accepted baseline `0.960596 / 0.845340` and llm `0.966030 / 0.841228`” numbers currently appear in OpenSpec design text, but no raw compare artifact for them was found in this workspace or the scanned `/hdd/code/runnable/SYM-20` tree. + +So the truthful current state is: + +- canonical comparable baseline: reproduced +- canonical comparable optimized: reproduced +- corrected comparable verdict: reproduced +- historical non-canonical sidecar-union metrics: explained and fenced off +- older accepted 0.960596 / 0.845340 claim: still lacks raw artifact provenance in the current workspace + +## Conclusion + +`SYM-20`’s main bug is now explicit: + +- historical `SYM-19` numbers were reported from a different binary generation and a different evaluator contract +- current repo scripts also carried forward a stale `0xcf000` default that belongs to the old binary generation, not the canonical `2026-04-28` GT bundle + +The repo-local canonical contract is: + +- canonical `2026-04-28` binary +- canonical `gtBlock.pb` +- ELF-derived `.text` start +- `runnable_base=0x50000000` +- fresh `runnable-cmp-eval` + +Under that corrected contract, the currently reproducible pair in this workspace is: + +- baseline: `precision=0.9523329079813952`, `recall=0.7765067563788988` +- optimized: `precision=0.9302727358917905`, `recall=0.7797547041527227` +- verdict: recall improves slightly, precision regresses materially, so the result is `mixed` + +## Next Step + +1. If the historical accepted `0.960596 / 0.845340` pair still matters, recover its raw compare artifact or regenerate it under a documented run directory +2. Decide whether the repo should standardize on the `textstart-*` pair above or on a separately recovered accepted baseline lineage +3. Keep future libcrypto claims on the canonical skill + `validate_libcrypto_ground_truth.py cmp` path only diff --git a/docs/exp/topics.md b/docs/exp/topics.md new file mode 100644 index 000000000..ab6abc40a --- /dev/null +++ b/docs/exp/topics.md @@ -0,0 +1,41 @@ +# Experiment Topics + +This page groups experiment-related documents by topic so that new runs can be found without knowing the exact date-based filename. + +## libcrypto + +Use this group for `libcrypto.so.3` lift, agent/classifier, and policy experiments. + +- [2026-04-22 libcrypto boundary audit](2026-04-22-libcrypto-boundary-audit.md) + - normalizes current illegal-entry candidates and emits deterministic boundary actions +- [2026-04-22 libcrypto old classifier validation](2026-04-22-libcrypto-old-classifier-validation.md) + - validates old `raw` and old `agent` classifiers on current `libcrypto.so.3` + - includes wrong-proxy failure and fixed-proxy rerun +- [2026-05-10 libcrypto canonical eval contract](2026-05-10-libcrypto-canonical-eval-contract.md) + - fixes the SYM-20 evaluation contract and separates canonical GT compare from historical sidecar-union metrics +- [../design/libcrypto/README.md](../design/libcrypto/README.md) + - method/design navigation for the `libcrypto` track +- [../reference/RUNNABLE_CORE_AND_METRICS.md](../reference/RUNNABLE_CORE_AND_METRICS.md) + - background on how `--llm-policy` fits into Runnable + +## Rewrite Validation + +Use this group for serial vs parallel rewrite checks and behavior-validation summaries. + +- [../results/2026-04-22-coreutils-serial-parallel-validation.md](../results/2026-04-22-coreutils-serial-parallel-validation.md) + - serial/parallel behavior comparison on `coreutils` + +## GT / Embedded Data + +Use this group for ground-truth methodology and embedded-data interpretation. + +- [../reference/README-embedata.md](../reference/README-embedata.md) + - concrete embedded-data examples +- [../design/2026-04-02-evaluation-wrapup-design.md](../design/2026-04-02-evaluation-wrapup-design.md) + - GT repair and stripped-OpenSSL design context + +## Notes + +- A single experiment can appear under multiple topics. +- Date-based experiment files should stay under `docs/exp/`. +- Broader validation summaries that are more like stable result notes can remain under `docs/results/` and still be linked here. diff --git a/scripts/libcrypto_bench_paths.py b/scripts/libcrypto_bench_paths.py new file mode 100644 index 000000000..30b0b28ff --- /dev/null +++ b/scripts/libcrypto_bench_paths.py @@ -0,0 +1,173 @@ +#!/usr/bin/env python3 +"""Resolve canonical libcrypto benchmark assets from repo-local or external roots.""" + +from __future__ import annotations + +import argparse +import os +import re +import subprocess +import sys +from pathlib import Path +from typing import Iterable, List + + +ROOT_DIR = Path(__file__).resolve().parents[1] +BENCH_ROOT_ENV = "RUNNABLE_LIBCRYPTO_BENCH_ROOT" +GROUND_TRUTH_ENV = "RUNNABLE_LIBCRYPTO_GROUND_TRUTH" +CANONICAL_GT_DIRNAME = "libcrypto_groudtruth_20260428" +LEGACY_GT_DIRNAME = "libcrypto_master_test_20260414" +CANONICAL_GT_BINARY_REL = ( + Path("archives") / "groundtruth" / CANONICAL_GT_DIRNAME / "libcrypto.so.3" +) +LEGACY_GT_BINARY_REL = ( + Path("archives") / "experiments" / LEGACY_GT_DIRNAME / "libcrypto.so.3" +) +READELF_TEXT_RE = re.compile(r"^\s*\[\s*\d+\]\s+(\S+)\s+\S+\s+([0-9a-fA-F]+)\s") + + +def _unique_paths(paths: Iterable[Path]) -> List[Path]: + unique: List[Path] = [] + seen: set[str] = set() + for path in paths: + key = os.path.normpath(str(path.expanduser())) + if key in seen: + continue + seen.add(key) + unique.append(Path(key)) + return unique + + +def configured_bench_roots(repo_root: Path = ROOT_DIR) -> List[Path]: + candidates: List[Path] = [] + env_root = os.environ.get(BENCH_ROOT_ENV) + if env_root: + candidates.append(Path(env_root).expanduser()) + candidates.append(repo_root.expanduser()) + return _unique_paths(candidates) + + +def _binary_candidates_from_root(root: Path) -> List[Path]: + root = root.expanduser() + if root.name == "libcrypto.so.3" or root.is_file(): + return [root] + + candidates = [ + root / CANONICAL_GT_BINARY_REL, + root / LEGACY_GT_BINARY_REL, + root / CANONICAL_GT_DIRNAME / "libcrypto.so.3", + root / LEGACY_GT_DIRNAME / "libcrypto.so.3", + root / "libcrypto.so.3", + ] + if root.name == "groundtruth": + candidates.insert(0, root / CANONICAL_GT_DIRNAME / "libcrypto.so.3") + if root.name == "experiments": + candidates.insert(0, root / LEGACY_GT_DIRNAME / "libcrypto.so.3") + return _unique_paths(candidates) + + +def binary_candidates(repo_root: Path = ROOT_DIR) -> List[Path]: + exact = os.environ.get(GROUND_TRUTH_ENV) + candidates: List[Path] = [] + if exact: + candidates.append(Path(exact).expanduser()) + for root in configured_bench_roots(repo_root): + candidates.extend(_binary_candidates_from_root(root)) + return _unique_paths(candidates) + + +def default_ground_truth_binary(repo_root: Path = ROOT_DIR) -> Path: + candidates = binary_candidates(repo_root) + for candidate in candidates: + if candidate.exists(): + return candidate.resolve() + return candidates[0] + + +def bench_root_for_binary(binary: Path, repo_root: Path = ROOT_DIR) -> Path: + resolved_binary = binary.resolve() + for root in configured_bench_roots(repo_root): + expanded_root = root.expanduser() + if expanded_root.is_file(): + if expanded_root.resolve() == resolved_binary: + return expanded_root.parent.resolve() + continue + try: + resolved_binary.relative_to(expanded_root.resolve()) + return expanded_root.resolve() + except ValueError: + continue + return resolved_binary.parent.resolve() + + +def _groundtruth_pb_candidates(binary: Path) -> List[Path]: + candidates: List[Path] = [] + if ".so" in binary.name: + base = binary.name.split(".so", 1)[0] + candidates.append(binary.with_name(f"{base}.gtBlock.pb")) + candidates.append(Path(str(binary) + ".gtBlock.pb")) + candidates.append(binary.with_name(f"{binary.name}.gtBlock.pb")) + return _unique_paths(candidates) + + +def default_groundtruth_pb(repo_root: Path = ROOT_DIR) -> Path: + binary = default_ground_truth_binary(repo_root) + candidates = _groundtruth_pb_candidates(binary) + for candidate in candidates: + if candidate.exists(): + return candidate.resolve() + return candidates[0] + + +def parse_text_start_from_readelf_output(output: str) -> int: + for line in output.splitlines(): + match = READELF_TEXT_RE.match(line) + if match and match.group(1) == ".text": + return int(match.group(2), 16) + raise RuntimeError("cannot detect .text start from readelf output") + + +def detect_text_start(binary: Path) -> int: + result = subprocess.run( + ["readelf", "-WS", str(binary)], + check=True, + capture_output=True, + text=True, + ) + return parse_text_start_from_readelf_output(result.stdout) + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + description="Resolve canonical libcrypto benchmark asset paths." + ) + parser.add_argument( + "kind", + choices=["bench-root", "binary", "groundtruth-pb", "text-start"], + ) + parser.add_argument("--repo-root", type=Path, default=ROOT_DIR) + parser.add_argument("--must-exist", action="store_true") + return parser + + +def main(argv: list[str] | None = None) -> int: + args = build_parser().parse_args(argv) + repo_root = args.repo_root.resolve() + if args.kind == "bench-root": + value = configured_bench_roots(repo_root)[0] + elif args.kind == "binary": + value = default_ground_truth_binary(repo_root) + elif args.kind == "groundtruth-pb": + value = default_groundtruth_pb(repo_root) + else: + value = hex(detect_text_start(default_ground_truth_binary(repo_root))) + + if args.kind != "text-start" and args.must_exist and not value.exists(): + print(str(value), file=sys.stderr) + return 2 + print(str(value)) + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/validate_libcrypto_ground_truth.py b/scripts/validate_libcrypto_ground_truth.py new file mode 100644 index 000000000..9492a2cef --- /dev/null +++ b/scripts/validate_libcrypto_ground_truth.py @@ -0,0 +1,533 @@ +#!/usr/bin/env python3 +"""Validate the refreshed libcrypto ground truth and compare lift outputs safely.""" + +from __future__ import annotations + +import argparse +import bisect +import collections +import importlib.util +import json +import re +import subprocess +import sys +from dataclasses import dataclass +from pathlib import Path +from typing import Dict, Iterable, List, Sequence, Tuple + +from libcrypto_bench_paths import default_ground_truth_binary, default_groundtruth_pb + + +ROOT_DIR = Path(__file__).resolve().parents[1] +SCRIPT_DIR = Path(__file__).resolve().parent +DEFAULT_BINARY = default_ground_truth_binary(ROOT_DIR) +DEFAULT_GROUNDTRUTH = default_groundtruth_pb(ROOT_DIR) +DEFAULT_RUN_CMP_EVAL = ( + ROOT_DIR / ".codex" / "skills" / "runnable-cmp-eval" / "scripts" / "run_cmp_eval.py" +) +DEFAULT_VENDOR_BLOCKS_PB2 = SCRIPT_DIR / "_vendor" / "blocks_pb2.py" +DEFAULT_OUT_DIR = ROOT_DIR / "runs" / "groundtruth_validation" +DEFAULT_RUNNABLE_BASE = 0x50000000 +DEFAULT_MIN_PRECISION = 0.80 +DEFAULT_MIN_RECALL = 0.80 +READELF_TEXT_RE = re.compile(r"^\s*\[\s*\d+\]\s+(\S+)\s+\S+\s+([0-9a-fA-F]+)\s") +OBJDUMP_INST_RE = re.compile(r"^\s*([0-9a-fA-F]+):\s+((?:[0-9a-fA-F]{2}\s)+)\s*(.*)$") + + +@dataclass +class CmpVerdict: + ok: bool + reasons: List[str] + + +def parse_int(value: str) -> int: + return int(value, 0) + + +def ensure_file(path: Path, label: str) -> None: + if not path.is_file(): + raise FileNotFoundError(f"{label} not found: {path}") + + +def ensure_dir(path: Path) -> None: + path.mkdir(parents=True, exist_ok=True) + + +def run_cmd(cmd: Sequence[str], *, check: bool = True) -> subprocess.CompletedProcess: + return subprocess.run(cmd, check=check, text=True, capture_output=True) + + +def read_json(path: Path) -> Dict[str, object]: + return json.loads(path.read_text(encoding="utf-8")) + + +def write_json(path: Path, payload: Dict[str, object]) -> None: + path.write_text(json.dumps(payload, indent=2) + "\n", encoding="utf-8") + + +def write_text(path: Path, lines: Iterable[str]) -> None: + path.write_text("\n".join(lines) + "\n", encoding="utf-8") + + +def detect_text_start(binary: Path) -> int: + out = run_cmd(["readelf", "-WS", str(binary)]).stdout + for line in out.splitlines(): + match = READELF_TEXT_RE.match(line) + if match and match.group(1) == ".text": + return int(match.group(2), 16) + raise RuntimeError(f"cannot detect .text start for {binary}") + + +def resolve_default_groundtruth_path(binary: Path) -> Path: + candidates: List[Path] = [] + if ".so" in binary.name: + base = binary.name.split(".so", 1)[0] + candidates.append(binary.with_name(f"{base}.gtBlock.pb")) + candidates.append(Path(str(binary) + ".gtBlock.pb")) + candidates.append(binary.with_name(f"{binary.name}.gtBlock.pb")) + + seen = set() + for candidate in candidates: + if candidate in seen: + continue + seen.add(candidate) + if candidate.exists(): + return candidate + raise FileNotFoundError( + f"could not infer groundtruth protobuf next to {binary}; tried: " + + ", ".join(str(path) for path in candidates) + ) + + +def load_blocks_pb2(path: Path): + spec = importlib.util.spec_from_file_location("groundtruth_blocks_pb2", path) + if spec is None or spec.loader is None: + raise RuntimeError(f"cannot load protobuf module from {path}") + module = importlib.util.module_from_spec(spec) + spec.loader.exec_module(module) + return module + + +def merge_ranges(ranges: Iterable[Tuple[int, int]]) -> List[Tuple[int, int]]: + sorted_ranges = sorted(ranges) + merged: List[List[int]] = [] + for start, end in sorted_ranges: + if not merged or start > merged[-1][1]: + merged.append([start, end]) + continue + merged[-1][1] = max(merged[-1][1], end) + return [(start, end) for start, end in merged] + + +def in_ranges(addr: int, ranges: Sequence[Tuple[int, int]], starts: Sequence[int]) -> bool: + idx = bisect.bisect_right(starts, addr) - 1 + if idx < 0: + return False + start, end = ranges[idx] + return start <= addr < end + + +def parse_groundtruth(gt_path: Path, blocks_pb2_path: Path) -> Tuple[set[int], List[Tuple[int, int]], List[Tuple[int, int]]]: + blocks_pb2 = load_blocks_pb2(blocks_pb2_path) + module = blocks_pb2.module() + module.ParseFromString(gt_path.read_bytes()) + + gt_inst_addrs: set[int] = set() + covered_ranges: List[Tuple[int, int]] = [] + padding_ranges: List[Tuple[int, int]] = [] + for func in module.fuc: + for bb in func.bb: + for inst in bb.instructions: + gt_inst_addrs.add(int(inst.va)) + va = int(bb.va) + size = int(bb.size) + padding = int(bb.padding) + covered_end = va + size - padding + if covered_end > va: + covered_ranges.append((va, covered_end)) + if padding > 0: + pad_start = covered_end + pad_end = va + size + if pad_end > pad_start: + padding_ranges.append((pad_start, pad_end)) + + return gt_inst_addrs, merge_ranges(covered_ranges), merge_ranges(padding_ranges) + + +def parse_objdump_instructions(binary: Path) -> List[Tuple[int, int, str]]: + out = run_cmd(["objdump", "-d", "-j", ".text", str(binary)]).stdout + instructions: List[Tuple[int, int, str]] = [] + for line in out.splitlines(): + match = OBJDUMP_INST_RE.match(line) + if not match: + continue + addr = int(match.group(1), 16) + bytes_field = match.group(2).strip() + asm = match.group(3).strip() + if not asm: + continue + size = len(bytes_field.split()) + instructions.append((addr, size, asm)) + return instructions + + +def merge_unseen_ranges(unseen: Sequence[Tuple[int, int, str]]) -> List[Tuple[int, int, int, List[str]]]: + if not unseen: + return [] + ranges: List[Tuple[int, int, int, List[str]]] = [] + start = unseen[0][0] + end = unseen[0][0] + unseen[0][1] + count = 1 + sample = [unseen[0][2]] + for addr, size, asm in unseen[1:]: + if addr == end: + end = addr + size + count += 1 + if len(sample) < 3: + sample.append(asm) + continue + ranges.append((start, end, count, sample)) + start = addr + end = addr + size + count = 1 + sample = [asm] + ranges.append((start, end, count, sample)) + return ranges + + +def analyze_groundtruth_gap( + binary: Path, + groundtruth: Path, + blocks_pb2_path: Path, +) -> Dict[str, object]: + gt_inst_addrs, covered_ranges, padding_ranges = parse_groundtruth( + groundtruth, blocks_pb2_path + ) + covered_starts = [start for start, _ in covered_ranges] + padding_starts = [start for start, _ in padding_ranges] + + obj_insts = parse_objdump_instructions(binary) + unseen = [ + (addr, size, asm) for addr, size, asm in obj_insts if addr not in gt_inst_addrs + ] + unseen_ranges = merge_unseen_ranges(unseen) + + instruction_category_counts = collections.Counter() + range_category_counts = collections.Counter() + instruction_examples = {"padding": [], "outside_gt_coverage": []} + range_examples = {"padding": [], "outside_gt_coverage": []} + + for addr, size, asm in unseen: + category = ( + "padding" + if in_ranges(addr, padding_ranges, padding_starts) + else "outside_gt_coverage" + ) + instruction_category_counts[category] += 1 + if len(instruction_examples[category]) < 20: + instruction_examples[category].append( + {"addr": hex(addr), "size": size, "asm": asm} + ) + + unseen_range_items = [] + for start, end, inst_count, sample_asm in unseen_ranges: + category = ( + "padding" + if in_ranges(start, padding_ranges, padding_starts) + else "outside_gt_coverage" + ) + range_category_counts[category] += 1 + item = { + "start": hex(start), + "end": hex(end), + "inst_count": inst_count, + "category": category, + "sample_asm": sample_asm, + } + unseen_range_items.append(item) + if len(range_examples[category]) < 20: + range_examples[category].append(item) + + unseen_ratio = len(unseen) / len(gt_inst_addrs) if gt_inst_addrs else 0.0 + return { + "binary": str(binary), + "groundtruth": str(groundtruth), + "blocks_pb2": str(blocks_pb2_path), + "objdump_real_instruction_count": len(obj_insts), + "groundtruth_instruction_count": len(gt_inst_addrs), + "unseen_instruction_count": len(unseen), + "unseen_ratio_over_groundtruth": unseen_ratio, + "covered_range_count": len(covered_ranges), + "padding_range_count": len(padding_ranges), + "unseen_contiguous_range_count": len(unseen_ranges), + "instruction_category_counts": dict(instruction_category_counts), + "range_category_counts": dict(range_category_counts), + "instruction_examples": instruction_examples, + "range_examples": range_examples, + "unseen_ranges": unseen_range_items, + } + + +def write_gap_outputs(out_dir: Path, summary: Dict[str, object]) -> Tuple[Path, Path]: + ensure_dir(out_dir) + json_path = out_dir / "gap.summary.json" + txt_path = out_dir / "gap.summary.txt" + write_json(json_path, summary) + + lines = [ + f"binary: {summary['binary']}", + f"groundtruth: {summary['groundtruth']}", + f"blocks_pb2: {summary['blocks_pb2']}", + f"objdump_real_instruction_count: {summary['objdump_real_instruction_count']}", + f"groundtruth_instruction_count: {summary['groundtruth_instruction_count']}", + f"unseen_instruction_count: {summary['unseen_instruction_count']}", + "unseen_ratio_over_groundtruth: " + f"{float(summary['unseen_ratio_over_groundtruth']):.12f}", + f"unseen_contiguous_range_count: {summary['unseen_contiguous_range_count']}", + "", + "[instruction_category_counts]", + ] + for key, value in summary["instruction_category_counts"].items(): + lines.append(f"{key}: {value}") + lines.append("") + lines.append("[range_category_counts]") + for key, value in summary["range_category_counts"].items(): + lines.append(f"{key}: {value}") + write_text(txt_path, lines) + return json_path, txt_path + + +def assess_cmp_payload( + payload: Dict[str, object], + *, + min_precision: float, + min_recall: float, +) -> CmpVerdict: + reasons: List[str] = [] + precision = float(payload.get("precision", 0.0)) + recall = float(payload.get("recall", 0.0)) + + if precision < min_precision: + reasons.append(f"precision {precision:.6f} < {min_precision:.6f}") + if recall < min_recall: + reasons.append(f"recall {recall:.6f} < {min_recall:.6f}") + if precision < 0.20 and recall < 0.20: + reasons.append( + "metrics are catastrophically low; the .ll likely comes from a different binary build" + ) + return CmpVerdict(ok=not reasons, reasons=reasons) + + +def write_cmp_verdict(out_dir: Path, payload: Dict[str, object], verdict: CmpVerdict) -> Path: + verdict_path = out_dir / "cmp.verdict.txt" + lines = [ + f"binary: {payload.get('binary', '')}", + f"ll: {payload.get('ll', '')}", + f"text_start: 0x{int(payload.get('text_start', 0)):x}", + f"runnable_base: 0x{int(payload.get('runnable_base', 0)):x}", + f"precision: {float(payload.get('precision', 0.0)):.6f}", + f"recall: {float(payload.get('recall', 0.0)):.6f}", + f"ok: {str(verdict.ok).lower()}", + ] + if verdict.reasons: + lines.append("") + lines.append("[reasons]") + lines.extend(verdict.reasons) + write_text(verdict_path, lines) + return verdict_path + + +def run_cmp( + *, + binary: Path, + ll_path: Path, + out_dir: Path, + run_cmp_eval: Path, + text_start: int | None, + runnable_base: int, + min_precision: float, + min_recall: float, + examples: int, +) -> Tuple[Dict[str, object], CmpVerdict, Path, Path, Path]: + ensure_file(binary, "binary") + ensure_file(ll_path, "ll") + ensure_file(run_cmp_eval, "run_cmp_eval.py") + ensure_dir(out_dir) + + resolved_text_start = detect_text_start(binary) if text_start is None else text_start + json_out = out_dir / "cmp.json" + text_out = out_dir / "cmp.txt" + cmd = [ + "python3", + str(run_cmp_eval), + "--binary", + str(binary), + "--ll", + str(ll_path), + "--text-start", + hex(resolved_text_start), + "--runnable-base", + hex(runnable_base), + "--examples", + str(examples), + "--json-out", + str(json_out), + "--text-out", + str(text_out), + ] + result = run_cmd(cmd, check=False) + if result.returncode != 0: + raise RuntimeError( + "run_cmp_eval.py failed:\n" + f"stdout:\n{result.stdout}\n" + f"stderr:\n{result.stderr}" + ) + payload = read_json(json_out) + verdict = assess_cmp_payload( + payload, + min_precision=min_precision, + min_recall=min_recall, + ) + verdict_path = write_cmp_verdict(out_dir, payload, verdict) + return payload, verdict, json_out, text_out, verdict_path + + +def print_gap_summary(summary: Dict[str, object], txt_path: Path) -> None: + print(f"gap_summary={txt_path}") + print( + "gap_counts=" + f"unseen={summary['unseen_instruction_count']} " + f"outside_gt_coverage={summary['instruction_category_counts'].get('outside_gt_coverage', 0)} " + f"padding={summary['instruction_category_counts'].get('padding', 0)}" + ) + + +def print_cmp_summary( + payload: Dict[str, object], + verdict: CmpVerdict, + verdict_path: Path, +) -> None: + print(f"cmp_verdict={verdict_path}") + print( + "cmp_metrics=" + f"precision={float(payload.get('precision', 0.0)):.6f} " + f"recall={float(payload.get('recall', 0.0)):.6f} " + f"text_start=0x{int(payload.get('text_start', 0)):x} " + f"runnable_base=0x{int(payload.get('runnable_base', 0)):x}" + ) + if verdict.reasons: + for reason in verdict.reasons: + print(f"cmp_reason={reason}", file=sys.stderr) + + +def add_shared_binary_args(parser: argparse.ArgumentParser) -> None: + parser.add_argument("--binary", type=Path, default=DEFAULT_BINARY) + parser.add_argument("--groundtruth", type=Path, default=None) + parser.add_argument("--blocks-pb2", type=Path, default=DEFAULT_VENDOR_BLOCKS_PB2) + parser.add_argument("--out-dir", type=Path, default=DEFAULT_OUT_DIR) + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + description="Validate the refreshed libcrypto ground truth and compare lift outputs." + ) + subparsers = parser.add_subparsers(dest="command", required=True) + + gap = subparsers.add_parser("gap-audit", help="Check binary .text addresses against gtBlock.pb") + add_shared_binary_args(gap) + + cmp_parser = subparsers.add_parser("cmp", help="Run runnable-cmp-eval with safe defaults") + add_shared_binary_args(cmp_parser) + cmp_parser.add_argument("--ll", type=Path, required=True) + cmp_parser.add_argument("--run-cmp-eval", type=Path, default=DEFAULT_RUN_CMP_EVAL) + cmp_parser.add_argument("--text-start", default="elf") + cmp_parser.add_argument("--runnable-base", default=hex(DEFAULT_RUNNABLE_BASE)) + cmp_parser.add_argument("--min-precision", type=float, default=DEFAULT_MIN_PRECISION) + cmp_parser.add_argument("--min-recall", type=float, default=DEFAULT_MIN_RECALL) + cmp_parser.add_argument("--examples", type=int, default=10) + cmp_parser.add_argument("--allow-low-metrics", action="store_true") + + all_parser = subparsers.add_parser("all", help="Run gap audit and then compare a lift") + add_shared_binary_args(all_parser) + all_parser.add_argument("--ll", type=Path, required=True) + all_parser.add_argument("--run-cmp-eval", type=Path, default=DEFAULT_RUN_CMP_EVAL) + all_parser.add_argument("--text-start", default="elf") + all_parser.add_argument("--runnable-base", default=hex(DEFAULT_RUNNABLE_BASE)) + all_parser.add_argument("--min-precision", type=float, default=DEFAULT_MIN_PRECISION) + all_parser.add_argument("--min-recall", type=float, default=DEFAULT_MIN_RECALL) + all_parser.add_argument("--examples", type=int, default=10) + all_parser.add_argument("--allow-low-metrics", action="store_true") + return parser + + +def resolve_groundtruth(binary: Path, groundtruth: Path | None) -> Path: + return groundtruth.resolve() if groundtruth else resolve_default_groundtruth_path(binary.resolve()) + + +def parse_text_start_arg(value: str) -> int | None: + if value == "elf": + return None + return parse_int(value) + + +def cmd_gap_audit(args: argparse.Namespace) -> int: + binary = args.binary.resolve() + groundtruth = resolve_groundtruth(binary, args.groundtruth) + blocks_pb2 = args.blocks_pb2.resolve() + ensure_file(binary, "binary") + ensure_file(groundtruth, "groundtruth") + ensure_file(blocks_pb2, "blocks_pb2") + + summary = analyze_groundtruth_gap(binary, groundtruth, blocks_pb2) + _, txt_path = write_gap_outputs(args.out_dir.resolve(), summary) + print_gap_summary(summary, txt_path) + return 0 + + +def cmd_cmp(args: argparse.Namespace) -> int: + binary = args.binary.resolve() + groundtruth = resolve_groundtruth(binary, args.groundtruth) + ensure_file(groundtruth, "groundtruth") + payload, verdict, _, _, verdict_path = run_cmp( + binary=binary, + ll_path=args.ll.resolve(), + out_dir=args.out_dir.resolve(), + run_cmp_eval=args.run_cmp_eval.resolve(), + text_start=parse_text_start_arg(args.text_start), + runnable_base=parse_int(args.runnable_base), + min_precision=args.min_precision, + min_recall=args.min_recall, + examples=args.examples, + ) + print_cmp_summary(payload, verdict, verdict_path) + if verdict.ok or args.allow_low_metrics: + return 0 + return 3 + + +def cmd_all(args: argparse.Namespace) -> int: + gap_rc = cmd_gap_audit(args) + if gap_rc != 0: + return gap_rc + return cmd_cmp(args) + + +def main(argv: Sequence[str] | None = None) -> int: + parser = build_parser() + args = parser.parse_args(argv) + try: + if args.command == "gap-audit": + return cmd_gap_audit(args) + if args.command == "cmp": + return cmd_cmp(args) + if args.command == "all": + return cmd_all(args) + except Exception as exc: + print(str(exc), file=sys.stderr) + return 2 + parser.error(f"unknown command: {args.command}") + return 2 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/tests/test_libcrypto_bench_paths.py b/tests/test_libcrypto_bench_paths.py new file mode 100644 index 000000000..da0b7a9ab --- /dev/null +++ b/tests/test_libcrypto_bench_paths.py @@ -0,0 +1,42 @@ +import importlib.util +import sys +import unittest +from pathlib import Path + + +SCRIPT_PATH = Path(__file__).resolve().parents[1] / "scripts" / "libcrypto_bench_paths.py" + + +def load_module(): + spec = importlib.util.spec_from_file_location("libcrypto_bench_paths", SCRIPT_PATH) + module = importlib.util.module_from_spec(spec) + assert spec.loader is not None + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +class LibcryptoBenchPathsTests(unittest.TestCase): + def test_parse_text_start_from_readelf_output(self): + module = load_module() + output = """ + [11] .init PROGBITS 00000000000ce5c0 0ce5c0 00001c 00 AX 0 0 4 + [12] .plt PROGBITS 00000000000ce5e0 0ce5e0 000990 10 AX 0 0 16 + [13] .text PROGBITS 00000000000cef80 0cef80 2e306e 00 AX 0 0 64 +""" + + self.assertEqual(module.parse_text_start_from_readelf_output(output), 0xCEF80) + + def test_parse_text_start_from_readelf_output_raises_when_missing(self): + module = load_module() + + with self.assertRaises(RuntimeError): + module.parse_text_start_from_readelf_output("[ 1] .data PROGBITS 00000000 000000 000000 00 WA 0 0 1") + + def test_build_parser_supports_text_start_kind(self): + module = load_module() + parser = module.build_parser() + + args = parser.parse_args(["text-start"]) + + self.assertEqual(args.kind, "text-start") diff --git a/tests/test_validate_libcrypto_ground_truth.py b/tests/test_validate_libcrypto_ground_truth.py new file mode 100644 index 000000000..36608539d --- /dev/null +++ b/tests/test_validate_libcrypto_ground_truth.py @@ -0,0 +1,103 @@ +import importlib.util +import sys +import tempfile +import unittest +from pathlib import Path + + +SCRIPT_PATH = ( + Path(__file__).resolve().parents[1] / "scripts" / "validate_libcrypto_ground_truth.py" +) + + +def load_module(): + spec = importlib.util.spec_from_file_location( + "validate_libcrypto_ground_truth", SCRIPT_PATH + ) + module = importlib.util.module_from_spec(spec) + assert spec.loader is not None + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +class ValidateLibcryptoGroundTruthTests(unittest.TestCase): + def test_resolve_default_groundtruth_prefers_libcrypto_gtblock_name(self): + module = load_module() + with tempfile.TemporaryDirectory() as tmpdir: + root = Path(tmpdir) + binary = root / "libcrypto.so.3" + binary.write_bytes(b"\x7fELF") + expected = root / "libcrypto.gtBlock.pb" + expected.write_bytes(b"pb") + + resolved = module.resolve_default_groundtruth_path(binary) + + self.assertEqual(resolved, expected) + + def test_assess_cmp_payload_fails_low_metrics(self): + module = load_module() + payload = { + "binary": "/tmp/libcrypto.so.3", + "ll": "/tmp/libcrypto.so.3.ll", + "text_start": 0xCEF80, + "runnable_base": 0x50000000, + "obj_count": 679479, + "ll_count": 847589, + "hit": 24175, + "mismatch": 112625, + "obj_only": 542679, + "ll_only": 710789, + "false_negative": 655304, + "false_positive": 823414, + "precision": 0.028522, + "recall": 0.035579, + } + + verdict = module.assess_cmp_payload( + payload, + min_precision=0.80, + min_recall=0.80, + ) + + self.assertFalse(verdict.ok) + self.assertTrue( + any("precision" in reason for reason in verdict.reasons), + verdict.reasons, + ) + self.assertTrue( + any("recall" in reason for reason in verdict.reasons), + verdict.reasons, + ) + + def test_assess_cmp_payload_accepts_strong_metrics(self): + module = load_module() + payload = { + "binary": "/tmp/libcrypto.so.3", + "ll": "/tmp/libcrypto.so.3.ll", + "text_start": 0xCF000, + "runnable_base": 0x50000000, + "obj_count": 932819, + "ll_count": 847589, + "hit": 787556, + "mismatch": 1420, + "obj_only": 143843, + "ll_only": 58613, + "false_negative": 145263, + "false_positive": 60033, + "precision": 0.929172, + "recall": 0.844275, + } + + verdict = module.assess_cmp_payload( + payload, + min_precision=0.80, + min_recall=0.80, + ) + + self.assertTrue(verdict.ok) + self.assertEqual(verdict.reasons, []) + + +if __name__ == "__main__": + unittest.main()