Skip to content

Add a telemetry coverage test to lock the azure.ai.agents data-contract field set #8613

@therealjohn

Description

@therealjohn

Summary

The azure.ai.agents extension's leadership metrics (onboarding success rate, attempts-to-success, time-to-success, top errors, and the cross-client funnel) are computed by the Foundry Growth data-science team from a fixed set of telemetry fields emitted by azd core. There is currently nothing that fails CI when one of those fields is renamed, removed, or retyped, so a breaking change ships silently and the downstream KQL quietly returns wrong numbers.

This is not hypothetical -- it already happened. The original onboarding.kql filtered successful deploys with tobool(Props['cmd.exit.success']) == true, but cmd.exit.success does not exist in azd telemetry. Across 365 days of cmd.up + cmd.deploy events, 0 rows matched, so the success funnel was effectively reporting noise. Success is actually signaled by the top-level Success boolean / ResultCode. A coverage test would have caught a field-name drift like this before release.

The field set the data contract depends on

These azd-defined attribute keys are the brittle surface. All are declared in cli/azd/internal/tracing/fields/fields.go:

Command-grain (root command span):

  • CmdEntry -> cmd.entry
  • ProjectServiceHostsKey -> project.service.hosts
  • ProjectServiceTargetsKey -> project.service.targets
  • DevDeviceIdKey -> machine.devdeviceid
  • SubscriptionIdKey -> ad.subscription.id
  • ProjectNameKey -> project.name

Step-grain (exegraph.step child span):

  • ExeGraphStepNameKey -> exegraph.step.name
  • ExeGraphStepTagsKey -> exegraph.step.tags

Span status (out of scope for the field-constant test, listed for completeness): the top-level Success boolean and ResultCode come from the OTEL span status mapping, not from the fields package, so they are comparatively stable. The azd-defined keys above are the part that needs a guard.

Proposed guard

Extend the existing contract test at cli/azd/cmd/telemetry_coverage_test.go -- it already follows this exact pattern in TestTelemetryFieldConstants ("if a field constant is removed or renamed, this test will fail, catching regressions in the telemetry schema"). Add a focused subtest (e.g. AgentDataContractFields) that asserts each constant above resolves to its exact attribute key string, for example:

require.Equal(t, "cmd.entry", string(fields.CmdEntry.Key))
require.Equal(t, "project.service.hosts", string(fields.ProjectServiceHostsKey.Key))
require.Equal(t, "project.service.targets", string(fields.ProjectServiceTargetsKey.Key))
require.Equal(t, "machine.devdeviceid", string(fields.DevDeviceIdKey.Key))
require.Equal(t, "ad.subscription.id", string(fields.SubscriptionIdKey.Key))
require.Equal(t, "project.name", string(fields.ProjectNameKey.Key))
require.Equal(t, "exegraph.step.name", string(fields.ExeGraphStepNameKey.Key))
require.Equal(t, "exegraph.step.tags", string(fields.ExeGraphStepTagsKey.Key))

A renamed constant breaks the build; a changed key string breaks the assertion. Either way the change is caught in CI before it reaches customers and the DS queries.

Acceptance criteria

  • A test in cli/azd/cmd fails if any of the listed constants are renamed/removed or if their attribute key strings change.
  • The test references the azure.ai.agents data contract so a future editor understands why these specific fields are locked.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions