Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .claude-plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"name": "agentops-accelerator",
"source": "../../plugins/agentops",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
"version": "0.3.8",
"version": "0.3.12",
"keywords": [
"agentops",
"evaluation",
Expand Down
2 changes: 1 addition & 1 deletion .github/plugin/marketplace.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"name": "agentops-accelerator",
"source": "../../plugins/agentops",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
"version": "0.3.8",
"version": "0.3.12",
"keywords": [
"agentops",
"evaluation",
Expand Down
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,32 @@
All notable changes to this project will be documented in this file.
This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres to [Semantic Versioning](https://semver.org/).

## [Unreleased]

## [0.3.12] - 2026-06-09

### Added
- **Foundry observability readiness now spans eval, Doctor, Cockpit, and release evidence.**
`agentops.yaml` supports `dataset_kind`, `rubrics`, and `observability`
metadata for multi-turn coverage, rubric evaluator gates, trace sampling, and
replay/evaluation/dataset links. Doctor and Cockpit surface the readiness
state without mutating cloud resources, and release evidence records the same
signals for reviewers.
- **Trace promotion preserves evaluation lineage.** `agentops eval
promote-traces` now carries operation/span IDs, source system, agent version,
replay/evaluation URLs, sampling policy, and multi-turn message fields into
candidate datasets and their manifest.

### Changed
- **Rubric evaluators are executed through the azd backend.** When `rubrics:`
is configured, `agentops eval init` includes those evaluator names in the azd
recipe and `agentops eval run` fails closed outside `execution: azd`, so rubric
scores cannot be treated as evidence unless Foundry / azd actually ran them.
- **Tutorials now carry rubric and observability proof into evaluation and CI/CD.**
The Travel Agent flow keeps the existing smoke recording through step 10, then
upgrades the gate to multi-turn dataset rows, rubric thresholds, trace
sampling/replay lineage, and CI/CD workflows that reuse the same eval contract.

## [0.3.11] - 2026-06-08

### Fixed
Expand Down
16 changes: 13 additions & 3 deletions docs/tutorial-end-to-end.md
Original file line number Diff line number Diff line change
Expand Up @@ -444,6 +444,13 @@ Foundry through `agentops eval run`, so AgentOps can enforce thresholds and writ
repo-side evidence. AgentOps keeps the local path for hosted endpoints, models,
unsupported evaluator mappings, and fallback cases.

When the quality gate uses a task-specific rubric, choose the azd runner instead
of local execution. Add `rubrics:` to `agentops.yaml`, set
`rubrics[].evaluator` to the Foundry / azd evaluator name, set
`execution: azd`, and run `agentops eval init --force`. AgentOps then passes the
rubric evaluator into the generated azd recipe and fails closed if someone tries
to run that rubric gate with the local backend.

## 5. Run the first eval

For hosted agents or local fallback:
Expand Down Expand Up @@ -651,7 +658,9 @@ agentops workflow generate `

The generated workflows are intentionally boring:

- PR gate: evaluate and publish report/evidence.
- PR gate: evaluate and publish report/evidence. If `agentops.yaml` declares
rubric evaluators, this is the same azd/Foundry rubric gate you ran locally;
the PR does not downgrade to a plain smoke test.
- Dev/QA/Prod: deploy with azd or placeholders, then run readiness checks.
- Optional Doctor cadence: generate `--kinds doctor` separately if you want a
scheduled readiness run outside PRs.
Expand Down Expand Up @@ -698,10 +707,11 @@ Use this loop in the video:
| Signal | Foundry or Azure Monitor action | AgentOps handoff |
|---|---|---|
| App Insights connection | In Foundry, open the project or agent **Traces** view and connect an App Insights resource. Verify it under project connected resources. | Doctor checks whether telemetry wiring is discoverable. |
| Trace sampling | Configure the project's trace sampling policy in Foundry or the hosted-agent observability settings your team owns. Keep the policy name in `agentops.yaml` under `observability.trace_sampling`. | Doctor/evidence can show reviewers that live-quality sampling exists before traces are promoted. |
| Live trace | Run one playground prompt for a Prompt Agent, or call the hosted endpoint a few times. Open the agent **Traces** tab, wait 2-5 minutes if needed, and click the Trace ID. In the modal, inspect spans plus the **Input + Output** and **Metadata** tabs. | Evidence and Cockpit link reviewers back to the runtime view. |
| Operate summary | Switch to **Operate** -> **Overview**, select the same subscription/project, wait for metrics to sync, and use **Ask AI** for dashboard-level questions such as `Help me identify any issues or anomalies in my agent metrics.` | The summary informs the release discussion; AgentOps does not rewrite it. |
| Eval context | From a Foundry eval run, inspect row-level explanations and, when available, the trace attached to the interaction. | The repo keeps the exact target, dataset, gate, and evidence together. |
| Trace learning | Export or curate traces that represent real issues. | `agentops eval promote-traces` turns reviewed traces into regression candidates. |
| Eval context | From a Foundry eval run, inspect row-level explanations, rubric scores, and, when available, the trace attached to the interaction. | The repo keeps the exact target, dataset, rubric gate, and evidence together. |
| Trace learning | Export or curate traces that represent real issues, including conversation turns when present. | `agentops eval promote-traces` turns reviewed traces into regression candidates and preserves replay/evaluation lineage. |

For the screen recording, make the Foundry side visible before opening AgentOps
Cockpit:
Expand Down
15 changes: 15 additions & 0 deletions docs/tutorial-hosted-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -648,6 +648,14 @@ This is the core AgentOps loop for hosted endpoints: keep a stable dataset,
compare a changed runtime against the last known result, fix the agent, and
rerun the same gate before a PR or release.

If this hosted endpoint is backed by a Foundry / azd eval recipe, you can use
the same rubric contract as the prompt-agent Travel Agent tutorial before you
generate CI: set `execution: azd`, add `dataset_kind: multi-turn`, declare
`rubrics[].evaluator` in `agentops.yaml`, run `agentops eval init --force`, and
then run `agentops eval run`. AgentOps will require the azd backend whenever
rubrics are configured, so a passing hosted-agent gate means the rubric evaluator
actually ran instead of being recorded as metadata only.

## 10. Generate CI and Doctor evidence

Generate both the PR and dev deploy workflows with `--doctor-gate critical`
Expand All @@ -666,6 +674,13 @@ code .agentops\agent\report.md
code .agentops\release\latest\evidence.md
```

The generated PR gate reuses the same `agentops.yaml` contract. If you promoted
the hosted endpoint to an azd/Foundry eval recipe with rubrics, CI runs that
recipe and blocks on the rubric thresholds; otherwise it runs the local hosted
endpoint gate and normalized thresholds. In both cases Doctor and the evidence
pack surface multi-turn coverage, trace sampling readiness, replay/evaluation
links, and trace-to-dataset lineage when those signals exist.

> **`--deploy-mode prompt-agent` does not apply to hosted endpoints.**
> That mode is specific to Foundry prompt agents (the stage-prompt-as-
> candidate flow). For hosted endpoints, `agentops workflow generate`
Expand Down
71 changes: 69 additions & 2 deletions docs/tutorial-prompt-agent-quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -803,6 +803,66 @@ You should see `execution: azd` and `Threshold status: PASSED`. The raw azd run
details are kept under `.agentops/results/latest/` alongside AgentOps'
normalized `results.json` and `report.md`.

Before generating CI, turn the Travel Agent gate from a basic smoke test into
the proof you want reviewers to see later. Keep the recording you already made
through this step: the smoke run above proves the workspace works. The next
commands only harden the same gate.

Create a small conversation-shaped dataset. It still keeps `input` and
`expected` so AgentOps and azd can route the row, but it also carries the
conversation turns that multi-turn evaluators and trace-derived rows use:

```powershell
@'
{"input":"Plan a three-day Rome trip for a family with kids. Ask one clarification if needed.","expected":"The agent should preserve the family-with-kids constraint, propose a practical three-day Rome itinerary, include transit/rest pacing, and avoid claiming it can book live reservations.","messages":[{"role":"user","content":"We want to visit Rome with two kids."},{"role":"assistant","content":"How many days do you have and what pace do you prefer?"},{"role":"user","content":"Three days, moderate pace, museums and food."}]}
{"input":"Help me choose between Lisbon and Seattle for a low-budget food weekend.","expected":"The agent should compare both destinations, mention budget tradeoffs, food activities, transit/weather notes, and avoid unsupported price or booking claims.","messages":[{"role":"user","content":"I need a low-budget food weekend."},{"role":"assistant","content":"Are you choosing between specific cities?"},{"role":"user","content":"Lisbon or Seattle."}]}
'@ | Set-Content -Encoding utf8 .agentops\data\travel-conversations.jsonl
```

Then update the evaluation contract in `agentops.yaml`. The important part is
that `rubrics[].evaluator` names the rubric evaluator that Foundry / azd will
run. If your Foundry Observe flow generated a different rubric evaluator name,
use that exact name here.

```yaml
dataset: .agentops/data/travel-conversations.jsonl
dataset_kind: multi-turn

rubrics:
- name: travel-concierge-quality
evaluator: travel-concierge-quality
description: Scores the Travel Agent against the intended product behavior.
dimensions:
- name: task_success
description: Completes the user's travel-planning goal across the conversation.
weight: 0.5
- name: constraint_following
description: Carries user constraints such as kids, budget, duration, and pace.
weight: 0.3
- name: safe_booking_behavior
description: Avoids claiming live bookings, confirmations, or prices it cannot verify.
weight: 0.2

thresholds:
task_success: ">=4"
constraint_following: ">=4"
safe_booking_behavior: ">=4"
```

Re-run init so the azd recipe includes the rubric evaluator in the actual
evaluation, not only in documentation:

```powershell
agentops eval init --force
agentops eval run
```

If the rubric evaluator name is wrong or missing in Foundry, the run should fail
closed. That is intentional: a green gate must mean the rubric really ran. When
it passes, `results.json` records `execution: azd`, the evaluator list, the
rubric metadata from `agentops.yaml`, and threshold results for the rubric
dimensions.

## 11. Generate the PR + dev deploy workflows

> **Pipeline ownership.** This tutorial uses `agentops workflow generate`
Expand Down Expand Up @@ -846,7 +906,11 @@ The PR workflow now has two jobs:
`.agentops/deployments/agentops.candidate.yaml` pointing at the
staged candidate.
2. **`eval`** — runs `agentops eval run` against the candidate, then
runs Doctor with `--severity-fail critical`.
runs Doctor with `--severity-fail critical`. Because the previous step
moved the gate to `execution: azd` with `rubrics:`, the workflow is not
just checking a smoke response: it runs the Foundry / azd evaluation recipe,
applies the Travel Agent rubric dimensions as thresholds, and writes the
normalized rubric evidence to `.agentops/results/latest/results.json`.

> **Why does the PR workflow stage in dev, not sandbox?** The PR gate
> must evaluate the same target the deploy workflow will use. Sandbox
Expand All @@ -859,6 +923,9 @@ The PR workflow now has two jobs:
The dev deploy workflow stages a candidate (same logic), evaluates it,
summarizes the deployment via `prompt_deploy summarize`, and uploads
`.agentops/deployments/foundry-agent.json` as a workflow artifact.
The deploy gate uses the same rubric-aware `agentops eval run`, so the candidate
that lands in dev has already passed the conversation/rubric gate reviewers saw
on the PR.

The `--doctor-gate critical` flag controls the Doctor severity floor in
the PR template. The table below summarizes the three values:
Expand Down Expand Up @@ -1327,7 +1394,7 @@ deploys, explicit thresholds, or red-team/governance evidence. Treat those as th
hardening backlog. The eval gates and the dev deploy loop are
production-ready.

If you want to show the Build 2026 governance story in the video, keep it as a
If you want to show the governance evidence path in the video, keep it as a
short optional callout:

```powershell
Expand Down
2 changes: 1 addition & 1 deletion plugins/agentops/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
"name": "agentops-accelerator",
"displayName": "AgentOps Accelerator — Skills for GitHub Copilot",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
"version": "0.3.8",
"version": "0.3.12",
"publisher": "AgentOpsAccelerator",
"icon": "icon.png",
"license": "MIT",
Expand Down
2 changes: 1 addition & 1 deletion plugins/agentops/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "agentops-accelerator",
"description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
"version": "0.3.8",
"version": "0.3.12",
"author": {
"name": "AgentOps Accelerator",
"url": "https://github.com/Azure/agentops"
Expand Down
2 changes: 2 additions & 0 deletions src/agentops/agent/analyzer.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@
from agentops.agent.checks.foundry_config import run_foundry_config_check
from agentops.agent.checks.governance import run_governance_check
from agentops.agent.checks.latency import run_latency_check
from agentops.agent.checks.observability import run_observability_check
from agentops.agent.checks.opex_workspace import run_opex_workspace_check
from agentops.agent.checks.opex import run_opex_check
from agentops.agent.checks.posture import run_posture_check
Expand Down Expand Up @@ -146,6 +147,7 @@ def analyze(
findings.extend(run_posture_check(resources, posture_config))
findings.extend(run_opex_workspace_check(workspace))
findings.extend(run_governance_check(workspace))
findings.extend(run_observability_check(workspace))
findings.extend(run_opex_check(history, config.checks.opex))
findings.extend(run_release_readiness_check(workspace, history, foundry))
findings.extend(
Expand Down
56 changes: 56 additions & 0 deletions src/agentops/agent/checks/catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,18 @@
"safety.config.continuous_eval_disabled": (
"https://learn.microsoft.com/azure/ai-foundry/how-to/online-evaluation"
),
"observability.multiturn_coverage_missing": (
"https://learn.microsoft.com/azure/foundry/concepts/observability"
),
"observability.rubric_missing": (
"https://learn.microsoft.com/azure/foundry/concepts/observability"
),
"observability.trace_sampling_missing": (
"https://learn.microsoft.com/azure/foundry/concepts/observability"
),
"observability.trace_replay_missing": (
"https://learn.microsoft.com/azure/foundry/concepts/observability"
),
}


Expand Down Expand Up @@ -199,6 +211,28 @@ def is_llm_judged(self) -> bool:
requires=("results_history",),
flags=("dynamic_id",),
),
CheckSpec(
id="observability.multiturn_coverage_missing",
category=Category.QUALITY,
title="Multi-turn evaluation coverage is not declared yet",
summary=(
"The workspace does not declare multi-turn dataset coverage or "
"trace-derived conversation rows for Foundry multi-turn evals."
),
severities=(Severity.INFO,),
requires=("workspace",),
),
CheckSpec(
id="observability.rubric_missing",
category=Category.QUALITY,
title="No context-specific rubric evaluator is declared",
summary=(
"The workspace does not declare a Foundry rubric evaluator or "
"rubric dimensions that can be bound to release thresholds."
),
severities=(Severity.INFO,),
requires=("workspace",),
),
# ------------------------------------------------------------------
# Performance
# ------------------------------------------------------------------
Expand Down Expand Up @@ -438,6 +472,28 @@ def is_llm_judged(self) -> bool:
severities=(Severity.WARNING,),
requires=("foundry_control",),
),
CheckSpec(
id="observability.trace_sampling_missing",
category=Category.OPERATIONAL_EXCELLENCE,
title="Intelligent trace sampling is not evidence-ready",
summary=(
"The workspace does not declare Foundry trace sampling and the "
"trace-regression manifest does not include sampling lineage."
),
severities=(Severity.WARNING,),
requires=("workspace",),
),
CheckSpec(
id="observability.trace_replay_missing",
category=Category.OPERATIONAL_EXCELLENCE,
title="Trace replay link is not captured in release evidence",
summary=(
"The workspace has no trace replay URL in agentops.yaml or in "
"trace-derived dataset lineage."
),
severities=(Severity.INFO,),
requires=("workspace",),
),
CheckSpec(
id="opex.results_not_gitignored",
category=Category.OPERATIONAL_EXCELLENCE,
Expand Down
Loading
Loading