From ac63cd89bb9c0f80fa40168ba9c711cbb18231b7 Mon Sep 17 00:00:00 2001 From: Arun Sekhar Date: Wed, 10 Jun 2026 16:56:39 -0700 Subject: [PATCH] Document GatewayTimeout from Cognitive Services RP and add deployment-state poll to verify-claude-code (#36) A GatewayTimeout from Microsoft.CognitiveServices during the model-deployment step of azd up is an ARM-layer poll timeout on a long-running operation, not a real failure -- the RP usually keeps provisioning and the deployment reaches Succeeded minutes later. Changes: - README troubleshooting: new row describing the symptom, why it happens, and the safe recovery path. - skills/claude-on-foundry/SKILL.md: matching row in the DIAGNOSE provisioning-failures table. - scripts/verify-claude-code.ps1: new 4b check that queries 'az cognitiveservices account deployment list' for each ANTHROPIC_DEFAULT__MODEL value, reports provisioningState, and (with new -WaitForDeployment switch) polls until every deployment reaches a terminal state. Configurable via -WaitTimeoutSeconds (default 1800). - scripts/verify-claude-code.sh: sibling parity with --wait-for-deployment and --wait-timeout flags. --- README.md | 1 + scripts/verify-claude-code.ps1 | 105 ++++++++++++++++++++++++++++-- scripts/verify-claude-code.sh | 71 ++++++++++++++++++-- skills/claude-on-foundry/SKILL.md | 1 + 4 files changed, 167 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index b6e2930..a4d0c80 100644 --- a/README.md +++ b/README.md @@ -473,6 +473,7 @@ claude/ | `Error occurred when subscribing to Marketplace: Marketplace Subscription purchase eligibility check failed` | Your subscription cannot purchase the Anthropic offer (no entitlement, sandbox sub, paid-offer policy denial, etc.). Either use a subscription with Claude-on-Foundry entitlement, or pre-accept the agreement explicitly with `az term accept --publisher anthropic --product anthropic--offer --plan anthropic--plan-new`. | | Opaque `400 715-123420 "An error occurred. Please reach out to support for additional assistance."` on the Terraform deployment step (RG / Foundry account / project all succeed) | **Insufficient quota.** Terraform's `azapi_resource` bypasses ARM preflight validation and the Cognitive Services RP returns this generic code instead of `InsufficientQuota`. **Fix:** check `az cognitiveservices usage list -l --query "[?contains(name.value,'')]"` — if `currentValue + requestedCapacity > limit`, lower `CLAUDE_SONNET_CAPACITY` / `CLAUDE_HAIKU_CAPACITY` / `CLAUDE_OPUS_CAPACITY` via `azd env set`, delete unused deployments to free capacity, or request a quota increase in the Foundry portal. **Also check for soft-deleted accounts** still holding quota — see [Free quota held by soft-deleted accounts](#free-quota-held-by-soft-deleted-accounts). To confirm it really is quota, re-run on the Bicep variant which surfaces the clearer `InsufficientQuota` error. | | Bicep: `InsufficientQuota: This operation require N new capacity in quota Tokens Per Minute (thousands) - Claude , which is bigger than the current available capacity X. The current quota usage is U and the quota limit is L.` | Same root cause as `715-123420` above, just with a clear message because Bicep goes through ARM preflight. Lower the capacity env var(s) or free up quota. | +| `GatewayTimeout: The gateway did not receive a response from 'Microsoft.CognitiveServices' within the specified time period.` during the model deployment step, often with the deployment stuck in `Creating` | **ARM-layer poll timeout on a slow long-running operation, not a real failure.** The Cognitive Services RP keeps working after ARM gives up; the model deployment can still reach `Succeeded` minutes later. First-time Claude provisioning on a fresh resource is the slowest combination, and times vary by region and family. **Do not re-run `azd up` blindly — it can collide with the in-flight LRO.** Check the server-side state first: run `pwsh -File scripts/verify-claude-code.ps1 -WaitForDeployment` (POSIX: `bash scripts/verify-claude-code.sh --wait-for-deployment`), which polls `az cognitiveservices account deployment list` and waits while any deployment is still `Creating`. Or check directly: `az cognitiveservices account deployment list -g -n `. If state is already `Succeeded`, run `azd env refresh` to repopulate outputs and you're done. | | Preflight: `Marketplace offer ... not found` | `CLAUDE_MODEL_NAME` is misspelled, the model isn't in the Anthropic-on-Foundry catalog yet, or Anthropic changed the plan-name convention. | | Preflight: `Quota insufficient` (exit 6) | Requested `CLAUDE_*_CAPACITY` plus existing usage exceeds the per-region quota limit. Lower the requested capacity, free up quota by deleting unused deployments, or [purge soft-deleted accounts](#free-quota-held-by-soft-deleted-accounts) that may still be holding TPM. | | Quota looks full but you have no live deployments (`az cognitiveservices usage list` shows `currentValue > 0`, deployment still fails with `715-123420` / `InsufficientQuota`) | **Soft-deleted Cognitive Services accounts still reserve quota for 48 h.** A previous `azd down` (or any RG / account delete) puts the AIServices account in a recoverable state that keeps holding TPM. **Fix:** list and purge them: `az cognitiveservices account list-deleted -o table` then `az cognitiveservices account purge --name --location --resource-group ` for each. See [Free quota held by soft-deleted accounts](#free-quota-held-by-soft-deleted-accounts). | diff --git a/scripts/verify-claude-code.ps1 b/scripts/verify-claude-code.ps1 index 6e37430..e9b367b 100644 --- a/scripts/verify-claude-code.ps1 +++ b/scripts/verify-claude-code.ps1 @@ -12,12 +12,19 @@ schema the Claude Code VS Code extension reads. 3. `az` is logged in and the current token tenant matches the tenant that owns the Foundry resource (a mismatch is the #1 cause of 401s). - 4. The Claude Code CLI is on PATH. If not, the script prints the install + 4. Each Claude model deployment on the Foundry account has reached a + terminal `provisioningState`. If `azd up` returned a `GatewayTimeout` + from `Microsoft.CognitiveServices`, that is an ARM-layer poll timeout, + not a deployment failure — the RP often keeps going for many + more minutes. Re-running this verifier (optionally with + `-WaitForDeployment`) is the safe way to confirm the actual outcome + without colliding with the in-flight long-running operation. + 5. The Claude Code CLI is on PATH. If not, the script prints the install hint (or runs the official installer when `-AutoInstall` is set, the same gate as `CLAUDE_CODE_AUTO_INSTALL` in the postprovision hook). - 5. (Default) A non-interactive `claude -p` round trip against each + 6. (Default) A non-interactive `claude -p` round trip against each deployed family. Skips this step with `-SkipClaudeCall`. - 6. (Opt-in) A `python src/hello_claude.py` round trip exercising the + 7. (Opt-in) A `python src/hello_claude.py` round trip exercising the Anthropic SDK + Entra ID code path. Enable with `-RunPythonSample`. .PARAMETER RepoRoot @@ -35,6 +42,18 @@ Requires `.env.local` populated via `azd env get-values` and a venv with `pip install -r requirements.txt`. +.PARAMETER WaitForDeployment + If any Claude model deployment is still in a non-terminal state (e.g. + `Creating`), poll the Cognitive Services RP until every deployment + reaches `Succeeded` / `Failed` / `Canceled` or `-WaitTimeoutSeconds` + elapses. Use this after a `GatewayTimeout` from `azd up` to confirm + whether the deployment actually finished server-side. + +.PARAMETER WaitTimeoutSeconds + Maximum seconds to wait for non-terminal deployments when + `-WaitForDeployment` is set. Default: 1800 (30 min). Set to 0 for a + single status check with no polling. + .EXAMPLE pwsh -File scripts/verify-claude-code.ps1 # All checks + live claude -p round trip per deployed family. @@ -46,13 +65,20 @@ .EXAMPLE pwsh -File scripts/verify-claude-code.ps1 -RunPythonSample # Adds a Python Entra ID round trip on top of the standard checks. + +.EXAMPLE + pwsh -File scripts/verify-claude-code.ps1 -WaitForDeployment + # Use this after a GatewayTimeout from `azd up` to wait for the RP to + # finish provisioning the model deployment(s). #> [CmdletBinding()] param( [string] $RepoRoot, [switch] $AutoInstall, [switch] $SkipClaudeCall, - [switch] $RunPythonSample + [switch] $RunPythonSample, + [switch] $WaitForDeployment, + [int] $WaitTimeoutSeconds = 1800 ) $ErrorActionPreference = 'Stop' @@ -172,7 +198,8 @@ if (-not $azCmd) { $accountsJson = & az cognitiveservices account list -o json 2>$null if ($accountsJson) { $accounts = $accountsJson | ConvertFrom-Json - $found = $accounts | Where-Object { $_.name -eq $foundryResource } | Select-Object -First 1 + $script:foundryAccount = $accounts | Where-Object { $_.name -eq $foundryResource } | Select-Object -First 1 + $found = $script:foundryAccount } } catch { } if ($found) { @@ -187,6 +214,74 @@ if (-not $azCmd) { } } +# --------------------------------------------------------------------------- +# 4b. Model deployment provisioning state. +# +# A `GatewayTimeout` from `Microsoft.CognitiveServices` during `azd up` +# is an ARM-layer poll timeout, not a deployment failure -- the RP +# often keeps provisioning for many more minutes. This check asks the +# RP directly so we can confirm the actual outcome without re-running +# `azd up` (which can collide with the in-flight LRO). +# --------------------------------------------------------------------------- +if ($azCmd -and $script:foundryAccount -and $deployedFamilies.Count -gt 0) { + $rgName = $script:foundryAccount.resourceGroup + $expectedNames = @($deployedFamilies | ForEach-Object { $_.Deployment }) + $terminalStates = @('Succeeded', 'Failed', 'Canceled') + $deadline = (Get-Date).AddSeconds([math]::Max(0, $WaitTimeoutSeconds)) + $pollIntervalSec = 30 + $firstPass = $true + + while ($true) { + $deployments = @() + try { + $depsJson = & az cognitiveservices account deployment list -g $rgName -n $foundryResource -o json 2>$null + if ($depsJson) { $deployments = @($depsJson | ConvertFrom-Json) } + } catch { } + + $statuses = @() + $stillCreating = @() + foreach ($name in $expectedNames) { + $d = $deployments | Where-Object { $_.name -eq $name } | Select-Object -First 1 + if (-not $d) { + $statuses += [pscustomobject]@{ Name = $name; State = '' } + continue + } + $state = $d.properties.provisioningState + $statuses += [pscustomobject]@{ Name = $name; State = $state } + if ($state -and $terminalStates -notcontains $state) { + $stillCreating += $name + } + } + + if ($firstPass -or $stillCreating.Count -eq 0 -or -not $WaitForDeployment -or (Get-Date) -ge $deadline) { + foreach ($s in $statuses) { + $checkName = "Deployment '$($s.Name)'" + switch ($s.State) { + 'Succeeded' { Add-Result $checkName 'PASS' 'provisioningState=Succeeded' } + 'Failed' { Add-Result $checkName 'FAIL' 'provisioningState=Failed' } + 'Canceled' { Add-Result $checkName 'FAIL' 'provisioningState=Canceled' } + '' { Add-Result $checkName 'WARN' 'not found on Foundry account - may still be creating, or activator is stale' } + default { + $hint = if ($WaitForDeployment) { "still $($s.State) after waiting $WaitTimeoutSeconds s" } else { "provisioningState=$($s.State); rerun with -WaitForDeployment to poll" } + Add-Result $checkName 'WARN' $hint + } + } + } + } + + if (-not $WaitForDeployment -or $stillCreating.Count -eq 0 -or (Get-Date) -ge $deadline) { + break + } + + $remaining = [int]($deadline - (Get-Date)).TotalSeconds + Write-Host (" ... {0} deployment(s) still provisioning ({1}); polling again in {2}s (timeout in {3}s)" -f $stillCreating.Count, ($stillCreating -join ', '), $pollIntervalSec, $remaining) -ForegroundColor DarkGray + Start-Sleep -Seconds $pollIntervalSec + $firstPass = $false + } +} elseif ($WaitForDeployment) { + Add-Result 'Model deployment state' 'WARN' 'cannot poll - az not available, Foundry resource not visible, or no families set' +} + # --------------------------------------------------------------------------- # 5. Claude Code CLI on PATH (optional auto-install). # --------------------------------------------------------------------------- diff --git a/scripts/verify-claude-code.sh b/scripts/verify-claude-code.sh index a7b8aab..e042b96 100644 --- a/scripts/verify-claude-code.sh +++ b/scripts/verify-claude-code.sh @@ -7,6 +7,9 @@ # bash scripts/verify-claude-code.sh --skip-claude-call # config checks only, no token cost # bash scripts/verify-claude-code.sh --auto-install # install claude CLI if missing # bash scripts/verify-claude-code.sh --run-python-sample # also run python src/hello_claude.py +# bash scripts/verify-claude-code.sh --wait-for-deployment # poll RP while any deployment is still Creating +# (use after a GatewayTimeout from `azd up`) +# bash scripts/verify-claude-code.sh --wait-timeout 1800 # cap on --wait-for-deployment (default 1800s) # # Exit codes: # 0 all checks passed (warnings allowed) @@ -17,15 +20,19 @@ repo_root="" auto_install=0 skip_claude=0 run_python=0 +wait_deployment=0 +wait_timeout=1800 while [[ $# -gt 0 ]]; do case "$1" in - --repo-root) repo_root="$2"; shift 2 ;; - --auto-install) auto_install=1; shift ;; - --skip-claude-call) skip_claude=1; shift ;; - --run-python-sample) run_python=1; shift ;; - -h|--help) sed -n '2,15p' "$0"; exit 0 ;; - *) echo "Unknown flag: $1" >&2; exit 2 ;; + --repo-root) repo_root="$2"; shift 2 ;; + --auto-install) auto_install=1; shift ;; + --skip-claude-call) skip_claude=1; shift ;; + --run-python-sample) run_python=1; shift ;; + --wait-for-deployment) wait_deployment=1; shift ;; + --wait-timeout) wait_timeout="$2"; shift 2 ;; + -h|--help) sed -n '2,15p' "$0"; exit 0 ;; + *) echo "Unknown flag: $1" >&2; exit 2 ;; esac done @@ -142,6 +149,7 @@ else loc=$(az cognitiveservices account list -o tsv --query "[?name=='$foundry_resource'].location | [0]" 2>/dev/null || echo '') if [[ -n "$rg" ]]; then add_result PASS "Foundry resource reachable" "$foundry_resource (rg: $rg, location: $loc)" + foundry_rg="$rg" else add_result WARN "Foundry resource reachable" "$foundry_resource not visible to current az login - wrong tenant/subscription?" fi @@ -149,6 +157,57 @@ else fi fi +# 4b. Model deployment provisioning state. +# +# A `GatewayTimeout` from `Microsoft.CognitiveServices` during `azd up` +# is an ARM-layer poll timeout, not a deployment failure -- the RP +# often keeps provisioning for many more minutes. Ask the RP directly +# so we can confirm the actual outcome without re-running `azd up`. +foundry_rg="${foundry_rg:-}" +if command -v az >/dev/null 2>&1 && [[ -n "$foundry_rg" && ${#deployed_families[@]} -gt 0 ]]; then + poll_interval=30 + deadline=$(( $(date +%s) + (wait_timeout > 0 ? wait_timeout : 0) )) + first_pass=1 + while :; do + deps_json=$(az cognitiveservices account deployment list -g "$foundry_rg" -n "$foundry_resource" -o json 2>/dev/null || echo '[]') + still_creating=() + for entry in "${deployed_families[@]}"; do + name="${entry##*|}" + state=$(echo "$deps_json" | python -c "import json,sys; data=json.load(sys.stdin); m=[d for d in data if d.get('name')==sys.argv[1]]; print(m[0]['properties']['provisioningState'] if m else '')" "$name" 2>/dev/null || echo '') + case "$state" in + Succeeded|Failed|Canceled|''|'') : ;; + *) still_creating+=("$name") ;; + esac + if [[ $first_pass -eq 1 || ${#still_creating[@]} -eq 0 || $wait_deployment -eq 0 || $(date +%s) -ge $deadline ]]; then + case "$state" in + Succeeded) add_result PASS "Deployment '$name'" "provisioningState=Succeeded" ;; + Failed) add_result FAIL "Deployment '$name'" "provisioningState=Failed" ;; + Canceled) add_result FAIL "Deployment '$name'" "provisioningState=Canceled" ;; + '') add_result WARN "Deployment '$name'" "not found on Foundry account - may still be creating, or activator is stale" ;; + '') add_result WARN "Deployment '$name'" "could not parse deployment list (jq/python missing?)" ;; + *) + if [[ $wait_deployment -eq 1 ]]; then + add_result WARN "Deployment '$name'" "still $state after waiting ${wait_timeout}s" + else + add_result WARN "Deployment '$name'" "provisioningState=$state; rerun with --wait-for-deployment to poll" + fi + ;; + esac + fi + done + + if [[ $wait_deployment -eq 0 || ${#still_creating[@]} -eq 0 || $(date +%s) -ge $deadline ]]; then + break + fi + remaining=$(( deadline - $(date +%s) )) + printf " ${C_DIM}... %d deployment(s) still provisioning (%s); polling again in %ds (timeout in %ds)${C_RST}\n" "${#still_creating[@]}" "$(IFS=,; echo "${still_creating[*]}")" "$poll_interval" "$remaining" + sleep "$poll_interval" + first_pass=0 + done +elif [[ $wait_deployment -eq 1 ]]; then + add_result WARN "Model deployment state" "cannot poll - az not available, Foundry resource not visible, or no families set" +fi + # 5. Claude Code CLI on PATH. auto_install_env="${CLAUDE_CODE_AUTO_INSTALL:-}" auto_install_env_on=0 diff --git a/skills/claude-on-foundry/SKILL.md b/skills/claude-on-foundry/SKILL.md index 2a83883..ea00caa 100644 --- a/skills/claude-on-foundry/SKILL.md +++ b/skills/claude-on-foundry/SKILL.md @@ -116,6 +116,7 @@ Match the customer's exact error string to a row. Verify the diagnostic command | `Marketplace offer ... not found` (from preflight, exit 4) | `CLAUDE_*_MODEL` value is misspelled or that SKU isn't in the catalog. | `./Get-ClaudeCatalog.ps1` and grep the family. | Set `CLAUDE__MODEL` to a name from the catalog. | | `Quota insufficient` (from preflight, exit 6) | Requested capacity + existing usage > per-region limit. | `az cognitiveservices usage list -l --query "[?contains(name.value,'claude-')]"` | Lower `CLAUDE__CAPACITY`, free quota (see soft-delete row), or request a quota bump in the Foundry portal. | | Bicep: `InsufficientQuota: This operation require N new capacity in quota Tokens Per Minute (thousands) - Claude ` | Same as above; Bicep gets the clear message because it goes through ARM preflight. | Same diagnostic. | Same fix. | +| `GatewayTimeout: The gateway did not receive a response from 'Microsoft.CognitiveServices' within the specified time period.` — deployment stuck in `Creating` | ARM-layer poll timeout on a slow LRO, **not** a real failure. The RP keeps provisioning after ARM gives up; deployment usually reaches `Succeeded` minutes later. More likely on first-time deploys; varies by region and family. | `az cognitiveservices account deployment list -g -n -o table` — check `provisioningState`. | If `Succeeded`: run `azd env refresh` and proceed. If still `Creating`: wait it out with `pwsh -File scripts/verify-claude-code.ps1 -WaitForDeployment` (POSIX: `--wait-for-deployment`), which polls until terminal state. **Do not re-run `azd up`** — it can collide with the in-flight LRO. | | Terraform: opaque `400 715-123420 "An error occurred. Please reach out to support for additional assistance."` | **Almost always insufficient quota.** Terraform's `azapi_resource` skips ARM preflight so the RP returns this generic code. | `az cognitiveservices usage list -l --query "[?contains(name.value,'')].{quota:name.value, used:currentValue, limit:limit}" -o table` | If `used + requested > limit`: lower capacity OR purge soft-deleted accounts (next row). Re-run on Bicep variant if you need a clearer error. | | Quota looks full but no live deployments exist | Soft-deleted Cognitive Services accounts hold quota for up to 48 h. | `az cognitiveservices account list-deleted -o table` | **Confirm with user first**, then for each: `az cognitiveservices account purge --name --location --resource-group `. The original RG name is in the deleted-account id field 9. | | `Marketplace Subscription purchase eligibility check failed` | Subscription can't purchase the Anthropic offer (no entitlement / sandbox / paid-offer policy). | Confirm sub type (see [PLAN](#plan--before-azd-up)). | Either use a Claude-eligible sub, or pre-accept explicitly: `az term accept --publisher anthropic --product anthropic--offer --plan anthropic--plan-new`. |