Skip to content

Document GatewayTimeout from Cognitive Services RP and add deployment-state poll to verify-claude-code#37

Merged
achandmsft merged 1 commit into
mainfrom
document-gateway-timeout
Jun 11, 2026
Merged

Document GatewayTimeout from Cognitive Services RP and add deployment-state poll to verify-claude-code#37
achandmsft merged 1 commit into
mainfrom
document-gateway-timeout

Conversation

@achandmsft

Copy link
Copy Markdown
Collaborator

Closes #36.

What

azd up sometimes returns a GatewayTimeout from Microsoft.CognitiveServices while a Claude model deployment is still in the Creating state. The Cognitive Services RP usually keeps provisioning after ARM gives up, and the deployment reaches Succeeded minutes later. Today the kit doesn't document this and gives users no easy way to confirm the actual server-side outcome without re-running azd up (which can collide with the in-flight LRO).

{
    "status": "Failed",
    "error": {
        "code": "GatewayTimeout",
        "message": "The gateway did not receive a response from 'Microsoft.CognitiveServices' within the specified time period."
    }
}

Why

This is an ARM-layer poll timeout on a long-running operation, not a deployment failure. First-time Claude deployments on a fresh resource tend to be the slowest path, and provisioning time varies by region and family. Documenting the symptom and shipping a safe recovery path keeps users from accidentally racing the RP with a re-run.

Changes

  • README.md — new troubleshooting row describing the symptom, why it happens, and the safe recovery path (verifier + az cognitiveservices account deployment list + azd env refresh).
  • skills/claude-on-foundry/SKILL.md — matching row in the DIAGNOSE Provisioning failures table so AI assistants reading the skill surface the same guidance.
  • scripts/verify-claude-code.ps1 — new step 4b queries the RP for each ANTHROPIC_DEFAULT_<FAMILY>_MODEL value, reports provisioningState, and (with the new -WaitForDeployment switch) polls every 30 s until each deployment reaches Succeeded / Failed / Canceled or -WaitTimeoutSeconds (default 1800) elapses. Skipped gracefully when az is missing, the resource isn't visible, or no families are set.
  • scripts/verify-claude-code.sh — sibling parity with --wait-for-deployment and --wait-timeout <sec> flags.

Usage after a GatewayTimeout

pwsh -File scripts/verify-claude-code.ps1 -WaitForDeployment

POSIX:

bash scripts/verify-claude-code.sh --wait-for-deployment

Verification

  • pwsh -File scripts/verify-claude-code.ps1 -SkipClaudeCall runs cleanly on a workspace where the original Foundry account has already been torn down — the new check skips gracefully with cannot poll - az not available, Foundry resource not visible, or no families set only when -WaitForDeployment is explicitly set; silent otherwise.
  • PowerShell parses cleanly via Get-Command -Syntax.
  • Bash parses cleanly via bash -n.
  • No changes to IaC, env-var contract, or hook behavior.

Notes

  • No new dependencies. Reuses the az invocation that the verifier already requires for the tenant check.
  • The PowerShell variant uses native objects (ConvertFrom-Json); the bash variant uses Python for JSON parsing to avoid a hard jq dependency (mirroring the existing pattern in the script).

…-state poll to verify-claude-code (#36)

A GatewayTimeout from Microsoft.CognitiveServices during the model-deployment step of azd up is an ARM-layer poll timeout on a long-running operation, not a real failure -- the RP usually keeps provisioning and the deployment reaches Succeeded minutes later.

Changes:

- README troubleshooting: new row describing the symptom, why it happens, and the safe recovery path.

- skills/claude-on-foundry/SKILL.md: matching row in the DIAGNOSE provisioning-failures table.

- scripts/verify-claude-code.ps1: new 4b check that queries 'az cognitiveservices account deployment list' for each ANTHROPIC_DEFAULT_<FAMILY>_MODEL value, reports provisioningState, and (with new -WaitForDeployment switch) polls until every deployment reaches a terminal state. Configurable via -WaitTimeoutSeconds (default 1800).

- scripts/verify-claude-code.sh: sibling parity with --wait-for-deployment and --wait-timeout flags.
@achandmsft achandmsft merged commit 26a390b into main Jun 11, 2026
5 checks passed
@achandmsft achandmsft deleted the document-gateway-timeout branch June 11, 2026 00:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Document GatewayTimeout from Microsoft.CognitiveServices and add a server-side deployment-state check to verify-claude-code

1 participant