Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions docs/architecture/adr-unify-foundry-agent-config.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# ADR: Unify Foundry agent configuration in `azure.yaml`

**Status:** Proposed

**Date:** 2026-06-08
Comment on lines +1 to +5

### Context

The `azure.ai.agents` extension currently models a hosted Foundry agent across
**three files**: `azure.yaml`, `agent.yaml` (an `AgentDefinition`), and
`agent.manifest.yaml` (an `AgentManifest`). This creates several problems:

1. **Three files, overlapping data.** The agent name appears in three places,
container resources in two, the model deployment name in three. Two
templating syntaxes (`{{param}}` and `${ENV}`) overlap. Issue
[#7901](https://github.com/Azure/azure-dev/issues/7901) is a symptom — `init`
fails when run in a directory that already contains the manifest.
2. **Scope conflation.** `services.<agent>.config` mixes agent-scoped concerns
(container resources, env, startup command) with project-scoped resources
(model deployments, connections, toolboxes).
3. **No sharing across agents.** Project-scoped resources are nested under a
single agent, so a second agent that wants the same toolbox must redeclare
it. There is nowhere to say "this toolbox belongs to the project."
4. **Divergent tooling.** The Foundry Toolkit for VS Code parses `agent.yaml`
directly while `azd ai agent` works off an `AgentManifest`; the experiences
diverge.
5. **The manifest layer carries no weight.** `agent.manifest.yaml` was designed
for an agent catalog that was never built. Its templating isn't paying for
itself, and abstracting concrete values with `${ENV_VAR}` blurs the line
between a definition and a template.

We want `azure.yaml` to be the single source of truth: it describes **what
exists in the Foundry project** and **how the agent runs**; agent code defines
**what the agent does**; the azd environment carries deployment-target values;
Bicep is opt-in for developers who need full IaC reproducibility.

### Decision

Consolidate all hosted-agent configuration into `azure.yaml` and split it by
scope across two host kinds.

1. **`host: azure.ai.agent`** describes the agent runtime only. Its `config:`
block maps to the Foundry create-agent API: `kind`, `description`,
`metadata`, `protocols`, `container.resources`, `env`, `startupCommand`. The
project-scoped fields (`deployments`, `resources`, `toolConnections`,
`toolboxes`, `connections`) are removed from this schema.
2. **`host: azure.ai.project`** (new) owns all project-scoped Foundry
data-plane state that can't be modeled in ARM/Bicep — today toolboxes,
connections, and model deployments; future additions (eval datasets, vector
indexes, knowledge sources) go here too (`additionalProperties: true`). It is
a service **without source code**: no `project:` directory, no build, no
artifact. One `azure.ai.project` service per Foundry project.
3. **Agents reference the project via the existing `uses:` field**
(`uses: [foundry-project]`). This is azd's existing inter-service dependency
primitive on `ServiceConfig` — it orders deploys and injects the
dependency's outputs as env vars. **No `dependsOn`, no core schema change.**
4. **Container mode reuses the existing `docker:` block** (`docker.path`,
`docker.remoteBuild`) to build the image from a Dockerfile. An
already-published image is referenced via the existing top-level `image:`
field on `ServiceConfig` (e.g. `myregistry.azurecr.io/my-agent:v1`) — no
build, no new field. No new top-level `dockerfile` field — that would
conflict with the existing `DockerProjectOptions` on `ServiceConfig`.
5. **Code-deploy mode uses a typed `runtime: { stack, version }` block**,
following the existing azure.yaml runtime precedent — not a bare string.
6. **Deploy source is explicit and exactly one of three.** `image:` present →
use the existing pre-built image; `docker:` present → build the image from a
Dockerfile; `runtime:` present → zip the project for code-deploy. Zero or
more than one present → validation error. No silent defaults.
7. **`container.resources` (`cpu`/`memory`) applies to every deploy mode**, not
just container mode. The Foundry create-agent API carries `cpu` and `memory`
for both code-deploy and container/image agents; when omitted the extension
applies defaults (today `cpu: "1"`, `memory: "2Gi"`). So `container.resources`
stays in the agent `config:` regardless of whether `image:`, `docker:`, or
`runtime:` is used.
8. **`${VAR}` env-var expansion** uses the same syntax `azure.yaml` already
supports. The extension performs the expansion inside `config:` blocks
(the core framework does not expand `config:`).
9. **`init` populates `config:`** the way it currently populates `agent.yaml`,
and stops emitting `agent.yaml` / `agent.manifest.yaml`. A deprecation window
keeps reading the old files (with a warning + telemetry) before removal.

**Phasing (locked):**

- **Phase 1** — Consolidate files. Retire `agent.yaml` / `agent.manifest.yaml`,
move their content into `azure.yaml`, and ship the new `azure.ai.project`
service target. Provisioning continues to use existing Bicep/ARM.
- **Phase 2** — Bicep-less provisioning: the extension carries built-in Bicep
templates and generates ARM in memory during `azd provision` (similar to
`azd compose`), with opt-in eject to Bicep on disk. Gated on a separate RFC
(not yet filed) and on a real multi-agent sample that needs shared resources.

### Open Questions (to resolve before locking the schema)

This ADR records the agreed direction; the following provision/deploy boundary
questions (raised in the issue's framework review) must be answered before the
`azure.ai.project` schema is finalized:

1. **What exactly does `azure.ai.project` Deploy create vs. what stays in
provision/Bicep?** Toolboxes are clearly data-plane (Deploy). Model
deployments and connections are ambiguous — today deployments are serialized
to env vars and consumed by Bicep, and connection creation involves
ARM-level resources plus credential handling. We need a clear inventory of
the provision (ARM) vs. deploy (Foundry API) split.
2. **Idempotency and state.** On repeated `azd deploy`, does the project target
diff declared config against existing Foundry state and apply incremental
changes, or recreate from scratch? How are removals handled (a toolbox
deleted from config — does Deploy delete it from Foundry)? Relates to
[#8350](https://github.com/Azure/azure-dev/issues/8350) and
[#8349](https://github.com/Azure/azure-dev/issues/8349).
3. **Error recovery across the `uses` chain.** If `azure.ai.project` Deploy
fails partway (toolboxes created, a connection fails), what is the recovery
story for downstream agents that depend on it?
4. **Service-level `runtime:` typing.** The typed `runtime: { stack, version }`
block exists in the schema today only under the compose `appServiceResource`
definition, **not under `services.<name>`**, and `ServiceConfig` has no
`Runtime` field. Phase 1 can capture it via `AdditionalProperties` (untyped,
not validated by the core schema), or core can promote `runtime` to a
first-class service field. Decide whether the typing/validation cost
justifies a core change.

### Proposed `azure.yaml` shape

The proposed end-state of `azure.yaml` is captured as standalone, illustrative
examples alongside this ADR, one self-contained file per scenario, under
[`azure-yaml-examples/`](./azure-yaml-examples/):

| Scenario | File | Deploy source |
|---|---|---|
| New project, zip/code-deploy | [`code-deploy.yaml`](./azure-yaml-examples/code-deploy.yaml) | `runtime:` |
| New project, build from Dockerfile | [`container-build.yaml`](./azure-yaml-examples/container-build.yaml) | `docker:` |
| New project, existing pre-built image (ACR) | [`existing-image.yaml`](./azure-yaml-examples/existing-image.yaml) | `image:` |
| Existing project, reuse existing model | [`brownfield-existing-model.yaml`](./azure-yaml-examples/brownfield-existing-model.yaml) | `runtime:` |
| Existing project, create new model | [`brownfield-new-model.yaml`](./azure-yaml-examples/brownfield-new-model.yaml) | `runtime:` |

Each file pairs a single `host: azure.ai.project` service (project-scoped
data-plane resources — model deployments, connections, toolboxes) with a
`host: azure.ai.agent` service that references it via `uses:`. The three deploy
sources (`runtime:` / `docker:` / `image:`) are mutually exclusive; exactly one
is required. The brownfield files use the `resourceId:` field introduced by the
Bicep-less RFC [#8065](https://github.com/Azure/azure-dev/issues/8065) (Phase 2).
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Scenario: Brownfield — existing Foundry project, reuse an existing model.
#
# NOTE: the `resourceId:` field is introduced by the Bicep-less provisioning RFC
# (https://github.com/Azure/azure-dev/issues/8065), i.e. Phase 2. Its presence
# flips the project service to "existing project" mode: the Foundry project ARM
# resource is omitted from the synthesized Bicep and only referenced. Until
# Phase 2 lands, brownfield is handled via the USE_EXISTING_AI_PROJECT /
# AZURE_AI_PROJECT_ID azd environment variables. Illustrative, not runnable.

name: my-foundry-agents

services:
foundry-project:
host: azure.ai.project
resourceId: ${AZURE_AI_PROJECT_ID} # presence -> existing project mode
config:
# No deployments: — the model already exists in the project; the agent
# only references its deployment name below.
toolboxes:
agent-toolbox:
tools:
- { type: web_search }

my-agent:
project: src/my-agent
host: azure.ai.agent
uses: [foundry-project]
runtime:
stack: python
version: "3.13"
config:
kind: hosted
description: A basic agent hosted by Foundry.
protocols:
- { protocol: responses, version: 1.0.0 }
env:
AZURE_AI_MODEL_DEPLOYMENT_NAME: gpt-4.1-mini # reference only
startupCommand: python main.py
44 changes: 44 additions & 0 deletions docs/architecture/azure-yaml-examples/brownfield-new-model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Scenario: Brownfield — existing Foundry project, create a NEW model on it.
#
# NOTE: the `resourceId:` field is introduced by the Bicep-less provisioning RFC
# (https://github.com/Azure/azure-dev/issues/8065), i.e. Phase 2. `resourceId:`
# references the existing project (not created), but the deployment declared
# under `config:` is still synthesized (ARM-backed) against that project.
# Illustrative end-state, not a runnable template.

name: my-foundry-agents

services:
foundry-project:
host: azure.ai.project
resourceId: ${AZURE_AI_PROJECT_ID} # existing project (not created)
config:
deployments: # but this model IS created by azd
- name: gpt-4.1-mini2
model:
format: OpenAI
name: gpt-4.1-mini
version: "2025-04-14"
sku:
name: Standard
capacity: 10
toolboxes:
agent-toolbox:
tools:
- { type: web_search }

my-agent:
project: src/my-agent
host: azure.ai.agent
uses: [foundry-project]
runtime:
stack: python
version: "3.13"
config:
kind: hosted
description: A basic agent hosted by Foundry.
protocols:
- { protocol: responses, version: 1.0.0 }
env:
AZURE_AI_MODEL_DEPLOYMENT_NAME: gpt-4.1-mini2
startupCommand: python main.py
59 changes: 59 additions & 0 deletions docs/architecture/azure-yaml-examples/code-deploy.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Scenario: Code-deploy (zip) — new Foundry project.
#
# A `runtime:` block is present, so azd zips the project directory and Foundry
# schedules it on the appropriate managed base image. Illustrative end-state,
# not a runnable template.

name: my-foundry-agents

services:
# Project-scoped service: all Foundry data-plane resources in one place.
# No source directory, no build artifact — pure declarative state.
foundry-project:
host: azure.ai.project
config:
deployments:
- name: gpt-4.1-mini
model:
format: OpenAI
name: gpt-4.1-mini
version: "2025-04-14"
sku:
name: GlobalBatch
capacity: 10
connections:
- name: github-mcp-conn
category: CustomKeys
target: https://api.githubcopilot.com/mcp
authType: ApiKey
toolboxes:
agent-toolbox:
tools:
- { type: web_search }
- { type: code_interpreter }
- { type: mcp, connection: ${GITHUB_MCP_CONN} }
Comment on lines +32 to +34

# Agent-scoped service: the runtime. References the project via `uses`.
my-agent:
project: src/my-agent
host: azure.ai.agent
uses: [foundry-project]
# Deploy source: runtime present -> zip/code-deploy.
runtime:
stack: python
version: "3.13"
config:
kind: hosted
description: A basic agent hosted by Foundry.
protocols:
- { protocol: responses, version: 1.0.0 }
# cpu/memory apply to code-deploy too; defaults cpu:"1", memory:"2Gi".
container:
resources: { cpu: "0.5", memory: 1Gi }
env:
AZURE_AI_MODEL_DEPLOYMENT_NAME: gpt-4.1-mini
startupCommand: python main.py

# No infra: block needed by default. azd provision uses built-in Bicep templates
# internally (like azd compose). Opt in to Bicep on disk via `azd infra gen` or
# equivalent (separate RFC: https://github.com/Azure/azure-dev/issues/8065).
48 changes: 48 additions & 0 deletions docs/architecture/azure-yaml-examples/container-build.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Scenario: Container build from Dockerfile — new Foundry project.
#
# A `docker:` block is present, so azd builds and pushes the image via the
# existing docker.path / docker.remoteBuild fields. Illustrative end-state, not
# a runnable template.

name: my-foundry-agents

services:
foundry-project:
host: azure.ai.project
config:
deployments:
- name: gpt-4.1-mini
model:
format: OpenAI
name: gpt-4.1-mini
version: "2025-04-14"
sku:
name: GlobalBatch
capacity: 10
toolboxes:
agent-toolbox:
tools:
- { type: web_search }
- { type: code_interpreter }

my-agent:
project: src/my-agent
host: azure.ai.agent
uses: [foundry-project]
# Deploy source: docker present -> build the image from a Dockerfile.
docker:
path: Dockerfile
remoteBuild: true
config:
kind: hosted
description: A basic agent hosted by Foundry.
protocols:
- { protocol: responses, version: 1.0.0 }
container:
resources: { cpu: "0.25", memory: 0.5Gi }
env:
AZURE_AI_MODEL_DEPLOYMENT_NAME: gpt-4.1-mini

# No infra: block needed by default. azd provision uses built-in Bicep templates
# internally (like azd compose). Opt in to Bicep on disk via `azd infra gen` or
# equivalent (separate RFC: https://github.com/Azure/azure-dev/issues/8065).
45 changes: 45 additions & 0 deletions docs/architecture/azure-yaml-examples/existing-image.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Scenario: Existing pre-built image (e.g. ACR) — new Foundry project.
#
# An `image:` reference is present (no build, no zip). azd skips package/publish
# and Foundry pulls the already-published image directly. The top-level `image:`
# field is the existing ServiceConfig field (also used by containerapp), and it
# supports ${VAR} expansion. Illustrative end-state, not a runnable template.

name: my-foundry-agents

services:
foundry-project:
host: azure.ai.project
config:
deployments:
- name: gpt-4.1-mini
model:
format: OpenAI
name: gpt-4.1-mini
version: "2025-04-14"
sku:
name: GlobalBatch
capacity: 10
toolboxes:
agent-toolbox:
tools:
- { type: web_search }

my-agent:
host: azure.ai.agent
uses: [foundry-project]
# Deploy source: image present -> use the existing pre-built image as-is.
image: myregistry.azurecr.io/my-agent:v1
config:
kind: hosted
description: A basic agent hosted by Foundry.
protocols:
- { protocol: responses, version: 1.0.0 }
container:
resources: { cpu: "0.25", memory: 0.5Gi }
env:
AZURE_AI_MODEL_DEPLOYMENT_NAME: gpt-4.1-mini

# No infra: block needed by default. azd provision uses built-in Bicep templates
# internally (like azd compose). Opt in to Bicep on disk via `azd infra gen` or
# equivalent (separate RFC: https://github.com/Azure/azure-dev/issues/8065).
Loading