Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 24 additions & 20 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,14 +17,16 @@ When running evals or testing skills, create all workspaces in a temp location:

**Why:** Eval artifacts — branches, commits, local git config — leak into the real repo history and are painful to clean up. The skill source lives in a git repo; eval output does not belong here.

## Per-Skill Evals

Every repo-managed skill must include its own `evals/evals.json` file at `skills/<name>/evals/evals.json`.

- Treat this as a required artifact for every first-party skill in this repo
- Run evals **per skill**, not as one shared repo-level eval file
- Run evals from a temp workspace such as `$env:TEMP/<skill-name>-workspace/`, never from inside this repository
- When creating or modifying a repo-managed skill, run both `with_skill` and `without_skill` comparison executions from that temp workspace before the work is considered complete
## Per-Skill Evals

Every repo-managed skill must include its own `evals/evals.json` file at `skills/<name>/evals/evals.json`.

- Treat this as a required artifact for every first-party skill in this repo
- Eval entries may include an optional `files` array of skill-relative fixture paths such as `evals/files/example.md`
- When `files` is present, keep the paths relative to `skills/<name>/` and stage those fixtures into the temp eval workspace for both `with_skill` and `without_skill` runs
- Run evals **per skill**, not as one shared repo-level eval file
- Run evals from a temp workspace such as `$env:TEMP/<skill-name>-workspace/`, never from inside this repository
- When creating or modifying a repo-managed skill, run both `with_skill` and `without_skill` comparison executions from that temp workspace before the work is considered complete
- For a brand-new skill, the baseline is `without_skill`; for an existing skill, use either `without_skill` or the previous/original skill version as the baseline, matching the `skill-creator` benchmark flow
- Generate the human-review artifacts too: aggregate the comparison into `benchmark.json` and launch `eval-viewer/generate_review.py` from the installed Anthropic `skill-creator` copy (typically under `~/.agents/skills/skill-creator/` or `~/.claude/skills/skill-creator/`) so the user can inspect `Outputs` and `Benchmark` before sign-off
- Deterministic scaffold/template skills must keep local deterministic validators as well; evals supplement validators, they do not replace them
Expand Down Expand Up @@ -80,20 +82,22 @@ After changing any repo-managed skill, sync the touched files across the repo co
Every skill follows this layout:

```
skills/<name>/
├── SKILL.md # Required — the skill definition (loaded by Claude)
├── FORMS.md # Optional — structured form fields for parameter collection
├── assets/ # Optional — file templates, fonts, icons used in output
│ └── <variant>/ # Group by variant when a skill supports multiple (e.g. library/, app/)
├── scripts/ # Optional — executable code (Python, Bash, etc.)
├── references/ # Optional — detailed reference docs the agent consults during generation
└── evals/ # Required for repo-managed skills — per-skill eval prompts and expectations
```
skills/<name>/
├── SKILL.md # Required — the skill definition (loaded by Claude)
├── FORMS.md # Optional — structured form fields for parameter collection
├── assets/ # Optional — file templates, fonts, icons used in output
│ └── <variant>/ # Group by variant when a skill supports multiple (e.g. library/, app/)
├── scripts/ # Optional — executable code (Python, Bash, etc.)
├── references/ # Optional — detailed reference docs the agent consults during generation
└── evals/ # Required for repo-managed skills — per-skill eval prompts and expectations
└── files/ # Optional — input fixtures referenced by evals/evals.json files[]
```

- `SKILL.md` is the entry point — it contains the workflow, conventions, and step-by-step instructions
- `assets/` holds file templates, fonts, icons, and other static content used in output (the agent reads and substitutes placeholders)
- `references/` holds detailed specs that `SKILL.md` references but are too long to inline
- `evals/` holds the per-skill `evals.json` definitions used to verify that the skill still works after changes
- `assets/` holds file templates, fonts, icons, and other static content used in output (the agent reads and substitutes placeholders)
- `references/` holds detailed specs that `SKILL.md` references but are too long to inline
- `evals/` holds the per-skill `evals.json` definitions used to verify that the skill still works after changes
- `evals/files/` holds optional skill-local fixture inputs referenced by `evals/evals.json` when a benchmark needs attached source material

## Template Files Are Literal

Expand Down
23 changes: 22 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,26 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),

## [Unreleased]

## [0.3.2] - 2026-03-23

This is a minor release introducing the markdown-illustrator skill for visualization-first document analysis, with expanded repository branding, comprehensive skill documentation, and foundational eval fixture file infrastructure across the skill suite.

### Added

- `markdown-illustrator` skill that reads markdown files and generates a document-wide Visual Brief plus one compiled diffusion-ready prompt, with zero follow-up questions and inferred visual strategy defaults (hero-focused cinematic editorial by default, steerable toward whiteboard, blackboard, isometric, or blueprint styles),
- Hero image assets for repository branding at `/assets/hero.jpg` and for individual skills (`trunk-first-repo/assets/hero.jpg`),
- Optional `files` array support in eval infrastructure (`evals/evals.json`) to stage skill-relative fixture paths into temporary eval workspaces for both `with_skill` and `without_skill` runs,
- Eval fixtures for `markdown-illustrator` with real-world examples (microservices architecture, product launch, transformers explanation),
- Benchmark contract reference documentation in `skill-creator-agnostic` with fixture guidance patterns.

### Changed

- Enhanced README with markdown-illustrator installation snippet and comprehensive "Why markdown-illustrator?" section explaining visual-brief anchoring, inferred defaults, good trigger examples, and reference visual directions for users,
- Extended AGENTS.md with detailed eval fixture file documentation, explaining the optional `files` property and fixture staging workflow for skill evaluation,
- Updated CONTRIBUTING.md with eval fixture guidance and temp-workspace isolation setup instructions,
- Improved validation script (`validate-skill-templates.ps1`) to enforce fixture file path checks and consistency across skills,
- Applied fixture guidance pattern to `skill-creator-agnostic` with benchmark contract examples and reference documentation.

## [0.3.1] - 2026-03-19

This is a patch release introducing three new NuGet-focused skills and runner-agnostic benchmark tooling, with enhanced release automation, comprehensive documentation standardization, and skill refinements.
Expand Down Expand Up @@ -82,7 +102,8 @@ This is a minor release that introduces two complementary git workflow skills, e

- Improved scaffold fidelity with hidden `.bot` asset preservation, explicit UTF-8 and BOM handling, and checks aimed at preventing mojibake or incomplete generated output.

[Unreleased]: https://github.com/codebeltnet/agentic/compare/v0.3.1...HEAD
[Unreleased]: https://github.com/codebeltnet/agentic/compare/v0.3.2...HEAD
[0.3.2]: https://github.com/codebeltnet/agentic/compare/v0.3.1...v0.3.2
[0.3.1]: https://github.com/codebeltnet/agentic/compare/v0.3.0...v0.3.1
[0.3.0]: https://github.com/codebeltnet/agentic/compare/v0.2.0...v0.3.0
[0.2.0]: https://github.com/codebeltnet/agentic/compare/v0.1.0...v0.2.0
Expand Down
34 changes: 19 additions & 15 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,18 +68,21 @@ The `description` is the most important field — it's how the AI decides to loa

Evals let you verify the skill works and measure improvement over a baseline. Every repo-managed skill in this repository must include `evals/evals.json`:

```json
{
"skill_name": "your-skill-name",
"evals": [
{
"id": 0,
"prompt": "The user message to test against",
"expected_output": "What a correct response looks like — used for manual or automated grading"
}
]
}
```
```json
{
"skill_name": "your-skill-name",
"evals": [
{
"id": 0,
"prompt": "The user message to test against",
"expected_output": "What a correct response looks like — used for manual or automated grading",
"files": ["evals/files/example.md"]
}
]
}
```

`files` is optional. When present, list one or more fixture files relative to `skills/<name>/`. A common pattern is to store those fixtures under `evals/files/` so benchmark runners can copy or attach the same source inputs for both `with_skill` and `without_skill` runs.

Aim for 3–5 evals that cover distinct scenarios: happy path, edge cases, and cases where the skill should *not* do something.

Expand Down Expand Up @@ -130,9 +133,10 @@ powershell -NoProfile -ExecutionPolicy Bypass -File .\scripts\validate-skill-tem
- [ ] `SKILL.md` has valid front matter with `name` and `description`
- [ ] Skill is stack-agnostic (or clearly scoped to a specific tech in the name/description)
- [ ] Examples are generic — no personal emails, usernames, or project-specific identifiers
- [ ] At least one eval in `evals/evals.json`
- [ ] The skill's `evals/evals.json` exists and its `skill_name` matches the folder/frontmatter name
- [ ] Skill changes were benchmarked from a temp workspace with both `with_skill` and `without_skill` runs
- [ ] At least one eval in `evals/evals.json`
- [ ] The skill's `evals/evals.json` exists and its `skill_name` matches the folder/frontmatter name
- [ ] Any optional `files` entries in `evals/evals.json` point to real fixture files under the same skill folder
- [ ] Skill changes were benchmarked from a temp workspace with both `with_skill` and `without_skill` runs
- [ ] `benchmark.json` and `eval-viewer/generate_review.py` from the installed Anthropic `skill-creator` copy were used so a human could compare `Outputs` and `Benchmark`
- [ ] `scripts/validate-skill-templates.ps1` passes for the current working tree when changing scaffold or template behavior
- [ ] If CI is enabled for the branch, the GitHub Actions validation job passes too
Expand Down
Loading
Loading