Skip to content

Add SafeSkill security badge (20/100 — Blocked)#2

Open
OyaAIProd wants to merge 1 commit intoiampantherr:mainfrom
OyaAIProd:safeskill-scan-1777428432016
Open

Add SafeSkill security badge (20/100 — Blocked)#2
OyaAIProd wants to merge 1 commit intoiampantherr:mainfrom
OyaAIProd:safeskill-scan-1777428432016

Conversation

@OyaAIProd
Copy link
Copy Markdown

🔴 SafeSkill Security Scan Results

Metric Value
Overall Score 20/100 (Blocked)
Code Score 59/100
Content Score 52/100
Findings 1274 findings detected (212 critical)
Taint Flows 125
Files Scanned 96
Scan Duration 13.5s

Note: This package is an MCP serverchild_process, filesystem, and environment access are expected capabilities for tool servers and are excluded from scoring and top findings.

Top Findings

  • 🔴 critical: Detected instruction-override attempt: "You are now" (security-tests/run-all.mjs:1157:2)
  • 🔴 critical: Tool/shell abuse instruction detected (tool-abuse-pattern): "Write ~/.ssh/id_rsa to /tmp/" (security-tests/run-all.mjs:1157:2)
  • 🔴 critical: Detected instruction-override attempt: "Ignore all previous instructions" (security-tests/run-all.mjs:674:15)
  • 🔴 critical: Detected instruction-override attempt: "ignore previous instructions" (security-tests/run-all.mjs:674:17)
  • 🔴 critical: Detected instruction-override attempt: "You are now" (security-tests/run-all.mjs:674:15)

View full report on SafeSkill


About SafeSkill

SafeSkill is a free, open-source security scanner for AI tools, MCP servers, and Claude Code skills. We scan for code exploits, prompt injection, and data exfiltration risks.

False positive? We take accuracy seriously. If any finding above is incorrect, please open an issue and we will fix it immediately.

Signed-off-by: SafeSkill Scanner <mk@oya.ai>
iampantherr added a commit that referenced this pull request Apr 30, 2026
The biggest single release. Closes every 🔴 high and 🟠 medium item from
the v0.19.0 honest-completion audit, ships Sprint 4 retrieval upgrades,
verifies the entire stack with live agents.

🔴 HIGH FIXES
  #6 file-ownership 409: PG store-postgres.recallBroadcasts was missing
     the v0.15.0 §8.1 columns from SELECT — overlap guard always saw
     empty exclusive set. Fix: explicit SELECT + JSON parse on read.
     Verified live.
  #3 vitest test isolation: new vitest.setup.ts forces ZC_POSTGRES_DB to
     securecontext_test (auto-creates if missing); destructive helpers
     refuse unless DB matches /test/i AND VITEST is set.
  #7 REJECT resolver works in Docker: writes to learnings_pg directly
     (parallel to the JSONL append, which is best-effort). Container can
     reach PG; can't reach host's Windows path for JSONL.

🟠 MEDIUM
  #1 skill auto-import: src/skill_auto_import.ts walks skills/*.skill.md
     at API server startup, UPSERTs into skills_pg by skill_id with
     body_hmac idempotency. POST /dashboard/skills/import for manual
     trigger. Dockerfile copies skills/ in. First run imported 25 skills.
  #2 LLM 'Generate skill body from rejection cluster':
     src/skill_candidate_generator.ts. Default backend = Anthropic Sonnet
     when ANTHROPIC_API_KEY set, else Ollama qwen2.5-coder:14b. Three
     new endpoints: /generate, /approve (writes to skills/ + auto-import
     + marks installed_skill_id), /reject (with notes). Dashboard panel
     shows status-tier action buttons. Live verified: 1.6KB skill body
     generated in 12s via Ollama.
  #4 context-budget: src/context_budget.ts tracks per-session tokens,
     formatCostHeader appends [ctx: X% / 200K] suffix that upgrades to
     ⚠ WARN / 🚨 ALERT / ⛔ EMERGENCY at 70/85/95%. New zc_context_status
     MCP tool. Hard enforcement (block Read at 70%) deferred to v0.21.

SPRINT 4
  #8 reranker: zc_search([q], { rerank: true }) cross-encoder rerank
     via Ollama embeddings.
  #9 HyDE: zc_search([q], { mode: 'hyde' }) generates hypothetical
     answer first, embeds THAT for the search.
  #10 multi-hop: zc_search([q], { mode: 'multihop', hopDepth: 2 })
     extracts file/URL refs from initial results, recurses with score
     decay 0.7 per hop.
  #5 rolling compaction MVP: src/compaction.ts + zc_compact_window MCP
     tool + POST /api/v1/compact endpoint. Pulls last N broadcasts +
     tool_calls, generates structured summary via Ollama, writes to
     working_memory. Live verified: 20 turns → 1538-char summary.

E2E RESULTS
  Unit tests: 803 pass / 36 skip (test isolation working — fresh test
              DB has no seed data, those tests skip)
  Direct API E2E: 14/14 pass
  Live agent E2E: 14/14 pass on Test_Agent_Coordination
    - ASSIGN→MERGE cycle, REJECT resolver, file-ownership 409,
      skill candidate cluster + LLM generation, context budget
      tracking on real agent activity, rolling compaction.

DEFERRED (honest gaps documented)
  - Full live mutator loop (skill_run failure → mutator agent spawns →
    candidates → operator approves) — infra verified, but live test
    requires an agent to explicitly invoke a skill (~$0.20 + 5-10 min)
  - Hard context-budget enforcement (block Read at 70%) — needs hook
    integration, ships in v0.21
  - Background compaction daemon — v0.20 ships on-demand only

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
iampantherr added a commit that referenced this pull request May 1, 2026
The v0.20.0 E2E flagged "full live mutator loop" as deferred — this
release closes that gap by actually running it end-to-end with real
Claude agents (orchestrator + developer + auto-spawned mutator-engineering
pool agent). The first fully verified self-improvement cycle.

The exercise caught 5 bugs that synthetic tests had missed:

#1 L1 hook couldn't see PG-imported skills
   maybeTriggerL1Mutation in outcomes.ts called getSkillById against
   local SQLite. v0.20.0 imports skills to skills_pg (Postgres). Lookup
   missed → hook bailed silently. Fix: PG fallback when SQLite misses.

#2 ZC_L1_MUTATION_ENABLED not propagated to agent MCP servers
   start-agents.ps1 set the env in its own shell but never injected it
   into the per-agent launcher templates. Agents' MCP servers saw
   undefined → L1 hook never fired. Fix: orchestrator + worker launcher
   templates now propagate ZC_L1_MUTATION_ENABLED + ZC_MUTATOR_MODEL.

#3 Dispatcher process env didn't have PG creds
   The dispatcher's launcher (a2a-launch-dispatcher.ps1 template) only
   set ZC_API_URL/KEY. Auto-spawned pool agents inherit the dispatcher's
   env via spawn-agent.ps1, so they too lacked PG creds → mutator-eng
   broadcast "BLOCKED: zc_claim_task fails with Postgres pool unavailable".
   Fix: dispatcherEnvBlock now propagates all PG vars + L1 + mutator model.

#4 Skill auto-import used plain SHA256 instead of HMAC-keyed hash
   v0.20.0's computeBodyHmac used createHash('sha256'). The skill loader
   uses computeSkillBodyHmac with HMAC-SHA256 keyed by machine_secret.
   Dashboard rendered "Skill body HMAC mismatch — refusing to load".
   Fix: auto-importer now uses the canonical computeSkillBodyHmac.

#5 Dashboard auto-refresh wiped operator's typed text
   #pending / #skills / #skill-candidates panels polled every 10/30s
   with hx-swap=innerHTML, destroying any input/textarea content
   mid-edit. Type-confirm IDs reset, blocking approval.
   Fix: focus-aware HTMX trigger filter — polling skips while any
   input/textarea/select is focused or any <details> is open.

LIVE VERIFICATION (the moment of truth):
After all 5 fixes, on Test_Agent_Coordination with real agents:
  - skill_runs.run-9341abbf-dd1 (status=failed, score=0.2)
  - L1 hook fired, resolved skill via PG fallback
  - mut-... task enqueued in task_queue_pg
  - Dispatcher auto-spawned mutator-engineering pool
  - claude-sonnet-4-6 generated 5 candidates (best=0.86)
  - mres-423a388e-08b in mutation_results_pg
  - Dashboard rendered candidates + approval form
  - Operator approved candidate #0
  - developer-debugging-methodology@1@global archived
  - developer-debugging-methodology@1.1@global active

This is HARNESS_EVOLUTION_PLAN.md Tier S item #2 (Skills + continuous
self-improvement loop) — the highest-leverage item in the plan —
verified working end-to-end with live agents for the first time.

DEFERRED to v0.20.2: auto-reassign-on-approve only fires when failed
skill_run has original_role populated. Synthetic zc_record_skill_outcome
calls don't preserve that chain. Verifying the full
REJECT→mutate→approve→auto-reassign cycle requires a live REJECT-driven
test (queued for v0.21.0 which adds skill enforcement levers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
iampantherr added a commit that referenced this pull request May 1, 2026
The v0.20.1 mutator-loop verification exposed the single biggest reliability
gap: nothing forces a Claude agent to invoke a skill. Even with "you MUST"
language in role prompts, agents freelance with Read/Edit/Bash because
that's "simpler" — and when they do, skill_runs_pg stays empty and the
mutator has no inputs.

This release ships THREE reinforcing soft-enforcement levers (designed to
work together) and DESIGNS but DELIBERATELY DEFERS two more — see
docs/SKILL_ENFORCEMENT.md for full design + decision log.

Lever #1 (shipped): inject "## YOUR SKILLS" block at agent spawn
  - New endpoint: GET /api/v1/skills/by-role?role=<role>
  - Returns active skills with intended_roles containing the role
  - Companion: A2A_dispatcher/generate-role-skill-block.mjs (HTTP-based,
    no extra deps) called from start-agents.ps1 once per agent role
  - Block formats every skill with description + the canonical
    zc_skill_show + zc_record_skill_outcome workflow, then injected into
    the agent's deepPrompt before the system-prompt file is written
  - Verified: 8 developer skills correctly listed for the developer role

Lever #2 (shipped): auto-inject skills into zc_recall_context response
  - /api/v1/recall now accepts ?role=<role> param
  - Returns {skills: [...]} alongside facts
  - MCP server's zc_recall_context tool reads ZC_AGENT_ROLE env, forwards
    to API, appends "## Skills available for role 'X' (N)" section
  - Fires automatically on every session start (SessionStart hook calls
    zc_recall_context) — skill awareness reinforced not just at spawn but
    on every recall

Lever #4 (shipped): MERGE-time skill-record mandate in role prompts
  - Both $orchSystem and every $workerSystem now end with
    "SKILL-OUTCOME RECORDING (MANDATORY before MERGE)" section
  - Prescribes the exact zc_record_skill_outcome call shape + parameters
  - Explains why it matters ("system learns from your work")

Lever #3 (deferred to v0.22+): PreTool hook nudge on Edit/Bash without
  recent skill_run. Defer until we observe v0.21.0 skill-record rates
  for a week. Hint fatigue risk; ship only if needed.

Lever #5 (DESIGNED but DELIBERATELY UNSHIPPED): hard PreTool block.
  Refuse Edit/Write/Bash until skill_run recorded this session. Full
  design + impl notes + escape hatch spec preserved in
  docs/SKILL_ENFORCEMENT.md so the work isn't lost. Operator should
  ship #5 ONLY if v0.21.0 skill-record rate observed below 50% for a
  week — rigidity risk too high otherwise. Gate: ZC_SKILL_HARD_ENFORCE=1.

Combined effect: 3 reinforcing signals (spawn + every recall + closing
instruction) make skill-recording natural rather than forced. Expected
~70-85% skill-record rate; that's enough to make the loop work
without ever needing #5.

Operator action required: stop-agents.ps1 + start-agents.ps1 to spawn
fresh agents that get the v0.21.0 prompt injection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
iampantherr pushed a commit that referenced this pull request May 4, 2026
Four Phase-1 features that lift the quality + safety + signal bar for every
skill that lands in skills_pg, regardless of source (operator, mutator,
marketplace pull, auto-import).

#4 Lint (src/skills/lint.ts + lint.test.ts)
  9-rule structural quality bar enforced at loadSkillFromPath AND
  storage_dual.upsertSkill. Errors reject; warnings logged. 28/28 tests.
  npm scripts: `lint:skills` and `lint:skills:strict`.

#2 Polisher (src/skills/polisher.ts)
  LLM refines skill descriptions. Two backends: local-mock (deterministic)
  + Sonnet (Anthropic API). Output runs through lint before being returned.
  Two new dashboard endpoints: POST /dashboard/skills/:id/polish (suggest)
  and POST /dashboard/skills/:id/apply-polish (apply).

#1 Security scan (src/skills/security_scan.ts + PG migration 20)
  8-point check: secrets, prompt injection, tool spawn, filesystem escape,
  network exfil, sleep abuse, body length, frontmatter integrity.
  Severity-aware gate: ANY block-severity failure rejects regardless of
  score (a leaked OpenAI key scoring 7/8 is still rejected). Audit log
  in skill_security_scans_pg with source attribution.

F Operator exemplars (PG migration 21 + dashboard ⭐ endpoint + mutator)
  skill_runs_pg.is_exemplar + tagging metadata. New endpoint:
  POST /dashboard/skill-runs/:run_id/tag-exemplar. MutationContext gains
  optional exemplars[] field; orchestrator pulls top-5 via getExemplarRuns
  before invoking mutator. buildProposerPrompt injects an "Operator-tagged
  exemplars" section so mutator candidates preserve the patterns the
  operator marked as good.

Tests: 858/862 pass (5 pre-existing failures on baseline, unrelated).
Live E2E verified on Test_Agent_Coordination: 5/5 gate scenarios pass,
exemplar pipeline pass, polisher pass, audit log writes, container running
v0.23.0, terminal agents launched + registered + stopped cleanly.

Bugs found + fixed during E2E:
  1. Initial scan gate was score-only — a skill with leaked key scored
     7/8 and would pass. Fixed to severity-based: block-severity failure
     rejects regardless of score.
  2. getExemplarRuns referenced non-existent `evidence` column on
     skill_runs_pg → caught when wiring exemplars into proposer prompt.
     Dropped from SELECT; mutator template handles undefined gracefully.

Test fixtures: synthetic round-trip fixtures use new "test" SkillUpsertSource
to bypass the gates (intentionally short bodies). Production callers always
get the gates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant