Azure · placerda · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json
@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.8",
+      "version": "0.3.12",
       "keywords": [
         "agentops",
         "evaluation",

diff --git a/.github/plugin/marketplace.json b/.github/plugin/marketplace.json
@@ -13,7 +13,7 @@
       "name": "agentops-accelerator",
       "source": "../../plugins/agentops",
       "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Toolkit and Microsoft Foundry agents.",
-      "version": "0.3.8",
+      "version": "0.3.12",
       "keywords": [
         "agentops",
         "evaluation",

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,32 @@
 All notable changes to this project will be documented in this file.
 This format follows [Keep a Changelog](https://keepachangelog.com/) and adheres to [Semantic Versioning](https://semver.org/).
 
+## [Unreleased]
+
+## [0.3.12] - 2026-06-09
+
+### Added
+- **Foundry observability readiness now spans eval, Doctor, Cockpit, and release evidence.**
+  `agentops.yaml` supports `dataset_kind`, `rubrics`, and `observability`
+  metadata for multi-turn coverage, rubric evaluator gates, trace sampling, and
+  replay/evaluation/dataset links. Doctor and Cockpit surface the readiness
+  state without mutating cloud resources, and release evidence records the same
+  signals for reviewers.
+- **Trace promotion preserves evaluation lineage.** `agentops eval
+  promote-traces` now carries operation/span IDs, source system, agent version,
+  replay/evaluation URLs, sampling policy, and multi-turn message fields into
+  candidate datasets and their manifest.
+
+### Changed
+- **Rubric evaluators are executed through the azd backend.** When `rubrics:`
+  is configured, `agentops eval init` includes those evaluator names in the azd
+  recipe and `agentops eval run` fails closed outside `execution: azd`, so rubric
+  scores cannot be treated as evidence unless Foundry / azd actually ran them.
+- **Tutorials now carry rubric and observability proof into evaluation and CI/CD.**
+  The Travel Agent flow keeps the existing smoke recording through step 10, then
+  upgrades the gate to multi-turn dataset rows, rubric thresholds, trace
+  sampling/replay lineage, and CI/CD workflows that reuse the same eval contract.
+
 ## [0.3.11] - 2026-06-08
 
 ### Fixed

diff --git a/docs/tutorial-end-to-end.md b/docs/tutorial-end-to-end.md
@@ -444,6 +444,13 @@ Foundry through `agentops eval run`, so AgentOps can enforce thresholds and writ
 repo-side evidence. AgentOps keeps the local path for hosted endpoints, models,
 unsupported evaluator mappings, and fallback cases.
 
+When the quality gate uses a task-specific rubric, choose the azd runner instead
+of local execution. Add `rubrics:` to `agentops.yaml`, set
+`rubrics[].evaluator` to the Foundry / azd evaluator name, set
+`execution: azd`, and run `agentops eval init --force`. AgentOps then passes the
+rubric evaluator into the generated azd recipe and fails closed if someone tries
+to run that rubric gate with the local backend.
+
 ## 5. Run the first eval
 
 For hosted agents or local fallback:
@@ -651,7 +658,9 @@ agentops workflow generate `
 
 The generated workflows are intentionally boring:
 
-- PR gate: evaluate and publish report/evidence.
+- PR gate: evaluate and publish report/evidence. If `agentops.yaml` declares
+  rubric evaluators, this is the same azd/Foundry rubric gate you ran locally;
+  the PR does not downgrade to a plain smoke test.
 - Dev/QA/Prod: deploy with azd or placeholders, then run readiness checks.
 - Optional Doctor cadence: generate `--kinds doctor` separately if you want a
   scheduled readiness run outside PRs.
@@ -698,10 +707,11 @@ Use this loop in the video:
 | Signal | Foundry or Azure Monitor action | AgentOps handoff |
 |---|---|---|
 | App Insights connection | In Foundry, open the project or agent **Traces** view and connect an App Insights resource. Verify it under project connected resources. | Doctor checks whether telemetry wiring is discoverable. |
+| Trace sampling | Configure the project's trace sampling policy in Foundry or the hosted-agent observability settings your team owns. Keep the policy name in `agentops.yaml` under `observability.trace_sampling`. | Doctor/evidence can show reviewers that live-quality sampling exists before traces are promoted. |
 | Live trace | Run one playground prompt for a Prompt Agent, or call the hosted endpoint a few times. Open the agent **Traces** tab, wait 2-5 minutes if needed, and click the Trace ID. In the modal, inspect spans plus the **Input + Output** and **Metadata** tabs. | Evidence and Cockpit link reviewers back to the runtime view. |
 | Operate summary | Switch to **Operate** -> **Overview**, select the same subscription/project, wait for metrics to sync, and use **Ask AI** for dashboard-level questions such as `Help me identify any issues or anomalies in my agent metrics.` | The summary informs the release discussion; AgentOps does not rewrite it. |
-| Eval context | From a Foundry eval run, inspect row-level explanations and, when available, the trace attached to the interaction. | The repo keeps the exact target, dataset, gate, and evidence together. |
-| Trace learning | Export or curate traces that represent real issues. | `agentops eval promote-traces` turns reviewed traces into regression candidates. |
+| Eval context | From a Foundry eval run, inspect row-level explanations, rubric scores, and, when available, the trace attached to the interaction. | The repo keeps the exact target, dataset, rubric gate, and evidence together. |
+| Trace learning | Export or curate traces that represent real issues, including conversation turns when present. | `agentops eval promote-traces` turns reviewed traces into regression candidates and preserves replay/evaluation lineage. |
 
 For the screen recording, make the Foundry side visible before opening AgentOps
 Cockpit:

diff --git a/docs/tutorial-hosted-agent-quickstart.md b/docs/tutorial-hosted-agent-quickstart.md
@@ -648,6 +648,14 @@ This is the core AgentOps loop for hosted endpoints: keep a stable dataset,
 compare a changed runtime against the last known result, fix the agent, and
 rerun the same gate before a PR or release.
 
+If this hosted endpoint is backed by a Foundry / azd eval recipe, you can use
+the same rubric contract as the prompt-agent Travel Agent tutorial before you
+generate CI: set `execution: azd`, add `dataset_kind: multi-turn`, declare
+`rubrics[].evaluator` in `agentops.yaml`, run `agentops eval init --force`, and
+then run `agentops eval run`. AgentOps will require the azd backend whenever
+rubrics are configured, so a passing hosted-agent gate means the rubric evaluator
+actually ran instead of being recorded as metadata only.
+
 ## 10. Generate CI and Doctor evidence
 
 Generate both the PR and dev deploy workflows with `--doctor-gate critical`
@@ -666,6 +674,13 @@ code .agentops\agent\report.md
 code .agentops\release\latest\evidence.md
 ```
 
+The generated PR gate reuses the same `agentops.yaml` contract. If you promoted
+the hosted endpoint to an azd/Foundry eval recipe with rubrics, CI runs that
+recipe and blocks on the rubric thresholds; otherwise it runs the local hosted
+endpoint gate and normalized thresholds. In both cases Doctor and the evidence
+pack surface multi-turn coverage, trace sampling readiness, replay/evaluation
+links, and trace-to-dataset lineage when those signals exist.
+
 > **`--deploy-mode prompt-agent` does not apply to hosted endpoints.**
 > That mode is specific to Foundry prompt agents (the stage-prompt-as-
 > candidate flow). For hosted endpoints, `agentops workflow generate`

diff --git a/docs/tutorial-prompt-agent-quickstart.md b/docs/tutorial-prompt-agent-quickstart.md
@@ -803,6 +803,66 @@ You should see `execution: azd` and `Threshold status: PASSED`. The raw azd run
 details are kept under `.agentops/results/latest/` alongside AgentOps'
 normalized `results.json` and `report.md`.
 
+Before generating CI, turn the Travel Agent gate from a basic smoke test into
+the proof you want reviewers to see later. Keep the recording you already made
+through this step: the smoke run above proves the workspace works. The next
+commands only harden the same gate.
+
+Create a small conversation-shaped dataset. It still keeps `input` and
+`expected` so AgentOps and azd can route the row, but it also carries the
+conversation turns that multi-turn evaluators and trace-derived rows use:
+
+```powershell
+@'
+{"input":"Plan a three-day Rome trip for a family with kids. Ask one clarification if needed.","expected":"The agent should preserve the family-with-kids constraint, propose a practical three-day Rome itinerary, include transit/rest pacing, and avoid claiming it can book live reservations.","messages":[{"role":"user","content":"We want to visit Rome with two kids."},{"role":"assistant","content":"How many days do you have and what pace do you prefer?"},{"role":"user","content":"Three days, moderate pace, museums and food."}]}
+{"input":"Help me choose between Lisbon and Seattle for a low-budget food weekend.","expected":"The agent should compare both destinations, mention budget tradeoffs, food activities, transit/weather notes, and avoid unsupported price or booking claims.","messages":[{"role":"user","content":"I need a low-budget food weekend."},{"role":"assistant","content":"Are you choosing between specific cities?"},{"role":"user","content":"Lisbon or Seattle."}]}
+'@ | Set-Content -Encoding utf8 .agentops\data\travel-conversations.jsonl
+```
+
+Then update the evaluation contract in `agentops.yaml`. The important part is
+that `rubrics[].evaluator` names the rubric evaluator that Foundry / azd will
+run. If your Foundry Observe flow generated a different rubric evaluator name,
+use that exact name here.
+
+```yaml
+dataset: .agentops/data/travel-conversations.jsonl
+dataset_kind: multi-turn
+
+rubrics:
+  - name: travel-concierge-quality
+    evaluator: travel-concierge-quality
+    description: Scores the Travel Agent against the intended product behavior.
+    dimensions:
+      - name: task_success
+        description: Completes the user's travel-planning goal across the conversation.
+        weight: 0.5
+      - name: constraint_following
+        description: Carries user constraints such as kids, budget, duration, and pace.
+        weight: 0.3
+      - name: safe_booking_behavior
+        description: Avoids claiming live bookings, confirmations, or prices it cannot verify.
+        weight: 0.2
+
+thresholds:
+  task_success: ">=4"
+  constraint_following: ">=4"
+  safe_booking_behavior: ">=4"
+```
+
+Re-run init so the azd recipe includes the rubric evaluator in the actual
+evaluation, not only in documentation:
+
+```powershell
+agentops eval init --force
+agentops eval run
+```
+
+If the rubric evaluator name is wrong or missing in Foundry, the run should fail
+closed. That is intentional: a green gate must mean the rubric really ran. When
+it passes, `results.json` records `execution: azd`, the evaluator list, the
+rubric metadata from `agentops.yaml`, and threshold results for the rubric
+dimensions.
+
 ## 11. Generate the PR + dev deploy workflows
 
 > **Pipeline ownership.** This tutorial uses `agentops workflow generate`
@@ -846,7 +906,11 @@ The PR workflow now has two jobs:
    `.agentops/deployments/agentops.candidate.yaml` pointing at the
    staged candidate.
 2. **`eval`** — runs `agentops eval run` against the candidate, then
-   runs Doctor with `--severity-fail critical`.
+   runs Doctor with `--severity-fail critical`. Because the previous step
+   moved the gate to `execution: azd` with `rubrics:`, the workflow is not
+   just checking a smoke response: it runs the Foundry / azd evaluation recipe,
+   applies the Travel Agent rubric dimensions as thresholds, and writes the
+   normalized rubric evidence to `.agentops/results/latest/results.json`.
 
 > **Why does the PR workflow stage in dev, not sandbox?** The PR gate
 > must evaluate the same target the deploy workflow will use. Sandbox
@@ -859,6 +923,9 @@ The PR workflow now has two jobs:
 The dev deploy workflow stages a candidate (same logic), evaluates it,
 summarizes the deployment via `prompt_deploy summarize`, and uploads
 `.agentops/deployments/foundry-agent.json` as a workflow artifact.
+The deploy gate uses the same rubric-aware `agentops eval run`, so the candidate
+that lands in dev has already passed the conversation/rubric gate reviewers saw
+on the PR.
 
 The `--doctor-gate critical` flag controls the Doctor severity floor in
 the PR template. The table below summarizes the three values:
@@ -1327,7 +1394,7 @@ deploys, explicit thresholds, or red-team/governance evidence. Treat those as th
 hardening backlog. The eval gates and the dev deploy loop are
 production-ready.
 
-If you want to show the Build 2026 governance story in the video, keep it as a
+If you want to show the governance evidence path in the video, keep it as a
 short optional callout:
 
 ```powershell

diff --git a/plugins/agentops/package.json b/plugins/agentops/package.json
@@ -2,7 +2,7 @@
   "name": "agentops-accelerator",
   "displayName": "AgentOps Accelerator — Skills for GitHub Copilot",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.8",
+  "version": "0.3.12",
   "publisher": "AgentOpsAccelerator",
   "icon": "icon.png",
   "license": "MIT",

diff --git a/plugins/agentops/plugin.json b/plugins/agentops/plugin.json
@@ -1,7 +1,7 @@
 {
   "name": "agentops-accelerator",
   "description": "Copilot agent skills for running standardized evaluation workflows with AgentOps Accelerator and Microsoft Foundry agents.",
-  "version": "0.3.8",
+  "version": "0.3.12",
   "author": {
     "name": "AgentOps Accelerator",
     "url": "https://github.com/Azure/agentops"

diff --git a/src/agentops/agent/analyzer.py b/src/agentops/agent/analyzer.py
@@ -12,6 +12,7 @@
 from agentops.agent.checks.foundry_config import run_foundry_config_check
 from agentops.agent.checks.governance import run_governance_check
 from agentops.agent.checks.latency import run_latency_check
+from agentops.agent.checks.observability import run_observability_check
 from agentops.agent.checks.opex_workspace import run_opex_workspace_check
 from agentops.agent.checks.opex import run_opex_check
 from agentops.agent.checks.posture import run_posture_check
@@ -146,6 +147,7 @@ def analyze(
     findings.extend(run_posture_check(resources, posture_config))
     findings.extend(run_opex_workspace_check(workspace))
     findings.extend(run_governance_check(workspace))
+    findings.extend(run_observability_check(workspace))
     findings.extend(run_opex_check(history, config.checks.opex))
     findings.extend(run_release_readiness_check(workspace, history, foundry))
     findings.extend(

diff --git a/src/agentops/agent/checks/catalog.py b/src/agentops/agent/checks/catalog.py
@@ -141,6 +141,18 @@
     "safety.config.continuous_eval_disabled": (
         "https://learn.microsoft.com/azure/ai-foundry/how-to/online-evaluation"
     ),
+    "observability.multiturn_coverage_missing": (
+        "https://learn.microsoft.com/azure/foundry/concepts/observability"
+    ),
+    "observability.rubric_missing": (
+        "https://learn.microsoft.com/azure/foundry/concepts/observability"
+    ),
+    "observability.trace_sampling_missing": (
+        "https://learn.microsoft.com/azure/foundry/concepts/observability"
+    ),
+    "observability.trace_replay_missing": (
+        "https://learn.microsoft.com/azure/foundry/concepts/observability"
+    ),
 }
 
 
@@ -199,6 +211,28 @@ def is_llm_judged(self) -> bool:
         requires=("results_history",),
         flags=("dynamic_id",),
     ),
+    CheckSpec(
+        id="observability.multiturn_coverage_missing",
+        category=Category.QUALITY,
+        title="Multi-turn evaluation coverage is not declared yet",
+        summary=(
+            "The workspace does not declare multi-turn dataset coverage or "
+            "trace-derived conversation rows for Foundry multi-turn evals."
+        ),
+        severities=(Severity.INFO,),
+        requires=("workspace",),
+    ),
+    CheckSpec(
+        id="observability.rubric_missing",
+        category=Category.QUALITY,
+        title="No context-specific rubric evaluator is declared",
+        summary=(
+            "The workspace does not declare a Foundry rubric evaluator or "
+            "rubric dimensions that can be bound to release thresholds."
+        ),
+        severities=(Severity.INFO,),
+        requires=("workspace",),
+    ),
     # ------------------------------------------------------------------
     # Performance
     # ------------------------------------------------------------------
@@ -438,6 +472,28 @@ def is_llm_judged(self) -> bool:
         severities=(Severity.WARNING,),
         requires=("foundry_control",),
     ),
+    CheckSpec(
+        id="observability.trace_sampling_missing",
+        category=Category.OPERATIONAL_EXCELLENCE,
+        title="Intelligent trace sampling is not evidence-ready",
+        summary=(
+            "The workspace does not declare Foundry trace sampling and the "
+            "trace-regression manifest does not include sampling lineage."
+        ),
+        severities=(Severity.WARNING,),
+        requires=("workspace",),
+    ),
+    CheckSpec(
+        id="observability.trace_replay_missing",
+        category=Category.OPERATIONAL_EXCELLENCE,
+        title="Trace replay link is not captured in release evidence",
+        summary=(
+            "The workspace has no trace replay URL in agentops.yaml or in "
+            "trace-derived dataset lineage."
+        ),
+        severities=(Severity.INFO,),
+        requires=("workspace",),
+    ),
     CheckSpec(
         id="opex.results_not_gitignored",
         category=Category.OPERATIONAL_EXCELLENCE,