feat: add Agent Skill Eval GitHub Action by zpzjzj · Pull Request #100 · alibaba/skill-up

zpzjzj · 2026-06-09T09:47:50Z

What

Adds a GitHub Action that runs skill-up behavioral evals on Agent Skills in CI — usable as uses: alibaba/skill-up@<tag> and publishable to the GitHub Marketplace. The differentiator vs existing skill-eval actions is multi-engine: the same eval.yaml is checked across claude_code / codex / qodercli.

Additive only — no changes to existing CLI files.

action.yml                                # root metadata + branding (Marketplace needs root + single action)
action/Dockerfile|entry.sh|main.py        # Docker container action: image bakes skill-up + 3 engines (public npm/Releases)
.github/workflows/build-runner-image.yml  # builds & pushes the runner image to ghcr (GITHUB_TOKEN, SHA-pinned, multi-arch)

main.py is adapted from the internal Aone CI component: drops the internal-config fetch and ghproxy, resolves paths against $GITHUB_WORKSPACE, exports results to $GITHUB_OUTPUT. Engine credential routing (api-key → ANTHROPIC/OPENAI/QODER env, codex provider/model ref) is preserved.

Usage

- uses: actions/checkout@v4
- uses: alibaba/skill-up@v0.1.0
  with:
    engine: claude_code        # or codex / qodercli; empty = let eval.yaml decide
    api-key: ${{ secrets.ANTHROPIC_API_KEY }}
    skill-target: evals/eval.yaml

Maintainer decisions needed

Runner image namespace. action.yml currently points at a public staging image ghcr.io/zpzjzj/skill-up-runner:v0.1.0 so the action works the moment this merges. To self-host under this repo's org, run build-runner-image.yml once (needs the org to allow Actions to create packages) and repoint runs.image. Flagged in the action.yml header.
Marketplace publish requires the org to accept the Marketplace Developer Agreement (org-owner, one-time) before the "Publish to Marketplace" checkbox enables. Merging makes the action usable; Marketplace searchability is that extra step.
Listing README = repo root README (the CLI's), since Marketplace pulls the root README. No action-specific listing page is possible in this layout.

Validation

python3 -m py_compile action/main.py ✓; engine routing / argv assembly / api-key masking smoke-tested
YAML validated; actions SHA-pinned to match repo convention
The identical image (same Dockerfile) already builds multi-arch (amd64+arm64) and is anonymously pullable — confirms the three engines + skill-up install cleanly from public registries

🤖 Generated with Claude Code

Add a Docker container action (root action.yml) that runs skill-up behavioral evals on Agent Skills in CI, across claude_code / codex / qodercli engines — usable as `uses: alibaba/skill-up@<tag>` and publishable to the GitHub Marketplace (root action.yml + branding). - action.yml: inputs/outputs + branding, references a prebuilt ghcr runner image; engine credential routing handled in action/main.py - action/: Docker context — Dockerfile bakes skill-up + the three engine CLIs from public registries; main.py resolves paths against GITHUB_WORKSPACE and exports results to GITHUB_OUTPUT - .github/workflows/build-runner-image.yml: builds & pushes the runner image to ghcr via GITHUB_TOKEN (SHA-pinned actions, multi-arch) Additive only; no changes to existing CLI files. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Document the Agent Skill Eval action in the root README (which also backs the Marketplace listing) and mirror it in README.zh.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add a workflow that runs the new action (uses: ./) against this repo's skill-upper skill on PRs touching the action or that skill — the action's end-to-end integration test. Model traffic pinned to dashscope qwen3.6-plus (Anthropic-compat) like the existing e2e jobs; fork PRs without DASHSCOPE_API_KEY skip cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Trigger build-runner-image on PRs/pushes touching the action (not just manual dispatch) so the image is published to ghcr.io/<org>/skill-up-runner by this repo itself. Fork PRs build without pushing (no package-write). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

build-runner-image.yml now publishes ghcr.io/<org>/skill-up-runner; point action.yml's runs.image at it instead of the personal staging image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Two failures surfaced by the dogfood run: - agent_judge (a separate claude_code call) ignored --model and fell back to a real-Claude model the dashscope endpoint rejects (400). Set the engine-scoped ANTHROPIC_MODEL/OPENAI_MODEL from the model input so every hop honors it, and pin ANTHROPIC_MODEL in the self-eval workflow too. - upload-artifact hit EACCES on root-owned report files (Docker action runs as root); chmod -R a+rX the workspace before upload. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Expand the self-eval into a matrix over claude_code / codex / qodercli so every engine the action supports is exercised end-to-end against skill-upper. Judge hops (agent_judge = a claude_code call regardless of case engine) get anthropic-protocol creds + model pin at the workflow level, mirroring the e2e full-LLM job; qodercli uses QODER_ACCESS_TOKEN. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The npm @qoder-ai/qodercli package is a different release line (1.x) whose CLI silently no-ops under skill-up's adapter invocation (--permission-mode=bypass_permissions -p, built against the installer's 0.x line) — every dogfood case returned empty output in ~1s with exit 0. Install via qoder.com/install and symlink into /usr/local/bin since $HOME differs at Actions runtime (/github/home vs /root at build). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Make build-runner-image callable (workflow_call) and have the self-eval workflow run it as a prerequisite job, so dogfood always pulls an image built from the same commit instead of racing the parallel build. - Switch the codex matrix entry (and OPENAI_MODEL judge pin) to qwen3-coder-plus: dashscope's qwen3.6-plus rejects codex tool calls ("function.arguments must be in JSON format"); the coder line is the combination already validated in the internal CI. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

qodercli's bun-compiled binary crashes under qemu arm64 emulation (its --version self-check exits 1), failing the multi-arch build. GitHub cloud runners are x86_64; drop arm64 until native arm runners matter. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Run the dogfood also on pushes to main touching the action or skill-upper (post-merge regression), and cancel superseded in-flight rounds on the same PR — LLM evals are expensive. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

qodercli 1.x exits 0 even when not logged in / on a bad PAT, so skill-up sees only empty output and the failure mode is invisible. Probe the real auth state (token length only + the qodercli -p response) so the log shows definitively whether the token, not the wiring, is the problem. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…5 turns) The multi-file scaffolding cases were timed for a fast engine; codex on a coder model engages but can't finish the scaffolding within 300s, timing out at 0/5. A timeout is a ceiling — claude_code/qodercli still finish well under it — so raising it lets slower coding agents complete without weakening any assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Keep wall-clock bounded now that the per-case timeout is 600s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

skill-upper's cases pin an anthropic judge model and skill-up runs agent_judge on the case engine, so codex (OpenAI protocol) can't drive the judge — an upstream limitation (#104), not an action defect. The codex agent itself works end-to-end with an OpenAI-protocol judge (verified locally). Mark codex continue-on-error so claude_code (4/5) and qodercli (5/5) gate the PR; flip back once #104 lands. Also drop the qoder auth diagnostic now that the token is confirmed valid. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

skill-up exits 1 on any case failure, but agent evals on CI's stand-in models are non-deterministic (claude_code is reliably 4/5, with the failing case rotating). Gating on 5/5 would block the PR permanently. Make the eval step continue-on-error and add a real gate that parses report.json and requires >=60% of cases to pass — tolerant of the occasional flaky case, but still catches a broadly broken run. codex (0%, structural #104) stays non-blocking via the job-level setting. Per-case status is printed and uploaded as an artifact. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… indexing Address CodeQL findings on skill-eval-self.yml: - Add a top-level 'permissions: contents: read' block (the image job keeps its packages: write override) so GITHUB_TOKEN is least-privilege. - Replace secrets[matrix.api-key-secret] (which CodeQL flags as exposing the whole secrets context) with literal-named references selected by engine: qodercli -> QODER_ACCESS_TOKEN, else DASHSCOPE_API_KEY. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codex can't drive skill-upper's anthropic-pinned agent_judge (upstream limitation #104). Previously codex ran the threshold gate, failed at 0%, and relied on job-level continue-on-error — which still surfaces a red, noisy check every round. Instead, skip the threshold step for codex so its job is green: it still runs and uploads its report for visibility, but its gate is explicitly waived (logged) pending #104. claude_code / qodercli keep the real threshold gate. Re-enable by removing the skip once #104 lands. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

zpzjzj requested a review from hittyt as a code owner June 9, 2026 09:47

zpzjzj and others added 3 commits June 9, 2026 17:53

docs: add GitHub Action section to README (en + zh)

64b5538

Document the Agent Skill Eval action in the root README (which also backs the Marketplace listing) and mirror it in README.zh.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems Jun 9, 2026

View reviewed changes

Comment thread .github/workflows/skill-eval-self.yml Fixed

zpzjzj and others added 3 commits June 10, 2026 09:03

feat: reference the runner image under the org namespace

8052feb

build-runner-image.yml now publishes ghcr.io/<org>/skill-up-runner; point action.yml's runs.image at it instead of the personal staging image. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread .github/workflows/skill-eval-self.yml Fixed

Comment thread .github/workflows/skill-eval-self.yml Fixed

Comment thread .github/workflows/skill-eval-self.yml Fixed

zpzjzj and others added 2 commits June 10, 2026 09:29

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread .github/workflows/skill-eval-self.yml Fixed

zpzjzj and others added 5 commits June 10, 2026 09:38

ci: run dogfood cases with parallelism 3

e131ab8

Keep wall-clock bounded now that the per-case timeout is 600s. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread .github/workflows/skill-eval-self.yml Fixed

zpzjzj and others added 2 commits June 10, 2026 21:57

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Comment thread .github/workflows/skill-eval-self.yml Fixed

zpzjzj and others added 2 commits June 11, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Agent Skill Eval GitHub Action#100

feat: add Agent Skill Eval GitHub Action#100
zpzjzj wants to merge 18 commits into
mainfrom
feat/skill-eval-action

zpzjzj commented Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zpzjzj commented Jun 9, 2026

What

Usage

Maintainer decisions needed

Validation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants