Document GatewayTimeout from Cognitive Services RP and add deployment-state poll to verify-claude-code#37
Merged
Merged
Conversation
…-state poll to verify-claude-code (#36) A GatewayTimeout from Microsoft.CognitiveServices during the model-deployment step of azd up is an ARM-layer poll timeout on a long-running operation, not a real failure -- the RP usually keeps provisioning and the deployment reaches Succeeded minutes later. Changes: - README troubleshooting: new row describing the symptom, why it happens, and the safe recovery path. - skills/claude-on-foundry/SKILL.md: matching row in the DIAGNOSE provisioning-failures table. - scripts/verify-claude-code.ps1: new 4b check that queries 'az cognitiveservices account deployment list' for each ANTHROPIC_DEFAULT_<FAMILY>_MODEL value, reports provisioningState, and (with new -WaitForDeployment switch) polls until every deployment reaches a terminal state. Configurable via -WaitTimeoutSeconds (default 1800). - scripts/verify-claude-code.sh: sibling parity with --wait-for-deployment and --wait-timeout flags.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #36.
What
azd upsometimes returns aGatewayTimeoutfromMicrosoft.CognitiveServiceswhile a Claude model deployment is still in theCreatingstate. The Cognitive Services RP usually keeps provisioning after ARM gives up, and the deployment reachesSucceededminutes later. Today the kit doesn't document this and gives users no easy way to confirm the actual server-side outcome without re-runningazd up(which can collide with the in-flight LRO).Why
This is an ARM-layer poll timeout on a long-running operation, not a deployment failure. First-time Claude deployments on a fresh resource tend to be the slowest path, and provisioning time varies by region and family. Documenting the symptom and shipping a safe recovery path keeps users from accidentally racing the RP with a re-run.
Changes
README.md— new troubleshooting row describing the symptom, why it happens, and the safe recovery path (verifier +az cognitiveservices account deployment list+azd env refresh).skills/claude-on-foundry/SKILL.md— matching row in the DIAGNOSE Provisioning failures table so AI assistants reading the skill surface the same guidance.scripts/verify-claude-code.ps1— new step 4b queries the RP for eachANTHROPIC_DEFAULT_<FAMILY>_MODELvalue, reportsprovisioningState, and (with the new-WaitForDeploymentswitch) polls every 30 s until each deployment reachesSucceeded/Failed/Canceledor-WaitTimeoutSeconds(default1800) elapses. Skipped gracefully whenazis missing, the resource isn't visible, or no families are set.scripts/verify-claude-code.sh— sibling parity with--wait-for-deploymentand--wait-timeout <sec>flags.Usage after a
GatewayTimeoutPOSIX:
Verification
pwsh -File scripts/verify-claude-code.ps1 -SkipClaudeCallruns cleanly on a workspace where the original Foundry account has already been torn down — the new check skips gracefully withcannot poll - az not available, Foundry resource not visible, or no families setonly when-WaitForDeploymentis explicitly set; silent otherwise.Get-Command -Syntax.bash -n.Notes
azinvocation that the verifier already requires for the tenant check.ConvertFrom-Json); the bash variant uses Python for JSON parsing to avoid a hardjqdependency (mirroring the existing pattern in the script).