Dstack-TEE · h4x3rotab · Dec 25, 2025 · Dec 27, 2025 · Dec 27, 2025 · Dec 28, 2025
diff --git a/.agent/GPU_TEE_DEPLOYMENT.md b/.agent/GPU_TEE_DEPLOYMENT.md
@@ -0,0 +1,235 @@
+# GPU TEE Deployment Guide
+
+Learnings from deploying GPU workloads to Phala Cloud TEE infrastructure.
+
+## Instance Types
+
+Query available instance types:
+```bash
+curl -s "https://cloud-api.phala.network/api/v1/instance-types" | jq
+```
+
+### CPU-only (Intel TDX)
+- `tdx.small` through `tdx.8xlarge`
+
+### GPU (H200 + TDX)
+- `h200.small` — Single H200 GPU, suitable for inference
+- `h200.16xlarge` — Multi-GPU for larger workloads
+- `h200.8x.large` — High-memory configuration
+
+## Deployment Commands
+
+### GPU Deployment
+```bash
+phala deploy -n my-app -c docker-compose.yaml \
+  --instance-type h200.small \
+  --region US-EAST-1 \
+  --image dstack-nvidia-dev-0.5.4.1
+```
+
+Key flags:
+- `--instance-type h200.small` — Required for GPU access
+- `--image dstack-nvidia-dev-0.5.4.1` — NVIDIA development image with GPU drivers
+- `--region US-EAST-1` — Region with GPU nodes (gpu-use2)
+
+### Debugging
+```bash
+# Check CVM status
+phala cvms list
+
+# View serial logs (boot + container output)
+phala cvms serial-logs <app_id> --tail 100
+
+# Delete CVM
+phala cvms delete <name-or-id> --force
+```
+
+## Docker Compose GPU Configuration
+
+GPU devices must be explicitly reserved in docker-compose.yaml:
+
+```yaml
+services:
+  my-gpu-app:
+    image: my-image
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+```
+
+Without the `deploy.resources.reservations.devices` section, the container will fail with:
+```
+libcuda.so.1: cannot open shared object file: No such file or directory
+```
+
+## vLLM Example
+
+Working docker-compose.yaml for vLLM inference:
+
+```yaml
+services:
+  vllm:
+    image: vllm/vllm-openai:latest
+    volumes:
+      - /var/run/dstack.sock:/var/run/dstack.sock
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=all
+      - HF_TOKEN=${HF_TOKEN:-}
+    ports:
+      - "8000:8000"
+    command: >
+      --model Qwen/Qwen2.5-1.5B-Instruct
+      --host 0.0.0.0
+      --port 8000
+      --max-model-len 4096
+      --gpu-memory-utilization 0.8
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+```
+
+## Endpoint URLs
+
+After deployment, the app is accessible at:
+```
+https://<app_id>-<port>.dstack-pha-<region>.phala.network
+```
+
+Example for vLLM on port 8000:
+```bash
+# List models
+curl https://<app_id>-8000.dstack-pha-use2.phala.network/v1/models
+
+# Chat completion
+curl -X POST https://<app_id>-8000.dstack-pha-use2.phala.network/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'
+```
+
+## vllm-proxy (Response Signing)
+
+vllm-proxy provides response signing and attestation for vLLM inference. It sits between clients and vLLM, signing responses with TEE-derived keys.
+
+### Configuration
+
+**IMPORTANT**: The authentication environment variable is `TOKEN`, not `AUTH_TOKEN`.
+
+```yaml
+services:
+  vllm:
+    image: vllm/vllm-openai:latest
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=all
+    command: >
+      --model Qwen/Qwen2.5-1.5B-Instruct
+      --host 0.0.0.0
+      --port 8000
+      --max-model-len 4096
+      --gpu-memory-utilization 0.8
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: all
+              capabilities: [gpu]
+
+  proxy:
+    image: phalanetwork/vllm-proxy:v0.2.18
+    volumes:
+      - /var/run/dstack.sock:/var/run/dstack.sock  # Required for TEE key derivation
+    environment:
+      - VLLM_BASE_URL=http://vllm:8000
+      - MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct
+      - TOKEN=your-secret-token    # NOT AUTH_TOKEN
+    ports:
+      - "8000:8000"
+    depends_on:
+      - vllm
+```
+
+### API Endpoints
+
+```bash
+# List models (no auth required)
+curl https://<endpoint>/v1/models
+
+# Chat completion (requires auth)
+curl -X POST https://<endpoint>/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer your-secret-token" \
+  -d '{"model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'
+
+# Get response signature
+curl https://<endpoint>/v1/signature/<chat_id> \
+  -H "Authorization: Bearer your-secret-token"
+
+# Attestation report
+curl https://<endpoint>/v1/attestation/report \
+  -H "Authorization: Bearer your-secret-token"
+```
+
+### Tested Configuration
+
+- Image: `phalanetwork/vllm-proxy:v0.2.18`
+- Instance: `h200.small`
+- Region: `US-EAST-1`
+- Model: `Qwen/Qwen2.5-1.5B-Instruct`
+
+### vllm-proxy Issues
+
+**"Invalid token" error**:
+- Check that you're using `TOKEN` environment variable, not `AUTH_TOKEN`
+- Verify the token value matches your request header
+
+**"All connection attempts failed" from proxy**:
+- vLLM is still loading the model (takes 1-2 minutes after container starts)
+- Wait for vLLM to show "Uvicorn running on" in serial logs
+
+**NVML error on attestation**:
+- GPU confidential computing attestation may not be fully available
+- This doesn't affect inference or response signing
+
+## Common Issues
+
+### "No available resources match your requirements"
+- GPU nodes are limited. Wait for other CVMs to finish or try a different region.
+- Ensure you're using the correct instance type (`h200.small`).
+
+### Container crashes with GPU errors
+- Add `deploy.resources.reservations.devices` section to docker-compose.yaml.
+- Verify using NVIDIA development image (`dstack-nvidia-dev-*`).
+
+### Image pull takes too long
+- Large images (5GB+ for vLLM) take 3-5 minutes to download and extract.
+- Check serial logs for progress.
+
+## Testing Workflow
+
+1. Deploy: `phala deploy -n test -c docker-compose.yaml --instance-type h200.small --region US-EAST-1 --image dstack-nvidia-dev-0.5.4.1`
+2. Wait for status: `phala cvms list` (wait for "running")
+3. Check logs: `phala cvms serial-logs <app_id> --tail 100`
+4. Test API: `curl https://<app_id>-<port>.dstack-pha-use2.phala.network/...`
+5. Cleanup: `phala cvms delete <name> --force`
+
+## GPU Wrapper Script
+
+For repeated GPU deployments, use a wrapper script:
+
+```bash
+#!/bin/bash
+# phala-gpu.sh
+source "$(dirname "$0")/.env"
+export PHALA_CLOUD_API_KEY=$PHALA_CLOUD_API_GPU
+phala "$@"
+```
+
+This allows maintaining separate API keys for CPU and GPU workspaces.
diff --git a/.agent/WRITING_GUIDE.md b/.agent/WRITING_GUIDE.md
@@ -0,0 +1,137 @@
+# Documentation Writing Guide
+
+Guidelines for writing dstack documentation, README, and marketing content.
+
+## Writing Style
+
+- **Don't over-explain** why a framework is needed — assert the solution, hint at alternatives being insufficient
+- **Avoid analogies as taglines** (e.g., "X for Y") — if it's a new category, don't frame it as a better version of something else
+- **Problem → Solution flow** without explicit labels like "The problem:" or "The solution:"
+- **Demonstrate features through actions**, not parenthetical annotations
+  - Bad: "Generates quotes (enabling *workload identity*)"
+  - Good: "Generates TDX attestation quotes so users can verify exactly what's running"
+
+## Procedural Documentation (Guides & Tutorials)
+
+### Test Before You Document
+- **Run every command** before documenting it — reading code is not enough
+- Commands may prompt for confirmation, require undocumented env vars, or fail silently
+- Create a test environment and execute the full flow end-to-end
+
+### Show What Success Looks Like
+- **Add sample outputs** after commands so users can verify they're on track
+- For deployment commands, show the key values users need to note (addresses, IDs)
+- For validation commands, show both success and failure outputs
+
+### Environment Variables
+- **List all required env vars explicitly** — don't assume users will discover them
+- If multiple tools use similar-but-different var names, clarify which is which
+- Show the export pattern once, then reference it in subsequent commands
+
+### Avoid Expert Blind Spots
+- If you say "add the hash", explain how to compute the hash
+- If you reference a file, explain where to find it
+- If a value comes from a previous step, remind users which step
+
+### Cross-Reference Related Docs
+- Link to prerequisite guides (don't repeat content)
+- Link to detailed guides for optional deep-dives
+- Use anchor links for specific sections when possible
+
+## Security Documentation
+
+### Trust Model Framing
+
+**Distinguish trust from verification:**
+- "Trust" = cannot be verified, must assume correct (e.g., hardware)
+- "Verify" = can be cryptographically proven (e.g., measured software)
+
+**Correct framing:**
+- Bad: "You must trust the OS" (when it's verifiable)
+- Good: "The OS is measured during boot and recorded in the attestation quote. You verify it by..."
+
+### Limitations: Be Honest, Not Alarmist
+
+State limitations plainly without false mitigations:
+- Bad: "X is a single point of failure. Mitigate by running your own X."
+- Good: "X is protected by [mechanism]. Like all [category] systems, [inherent limitation]. We are developing [actual solution] to address this."
+
+Don't suggest mitigations that don't actually help. If something is an inherent limitation of the technology, say so.
+
+## Documentation Quality Checklist
+
+From doc-requirements.md:
+
+1. **No bullet point walls** — Max 3-5 bullets before breaking with prose
+2. **No redundancy** — Don't present same info from opposite perspectives
+3. **Conversational language** — Write like explaining to a peer
+4. **Short paragraphs** — Max 4 sentences per paragraph
+5. **Lead with key takeaway** — First sentence tells reader why this matters
+6. **Active voice** — "TEE encrypts memory" not "Memory is encrypted by TEE"
+7. **Minimal em-dashes** — Max 1-2 per page, replace with "because", "so", or separate sentences
+
+### Redundancy Patterns to Avoid
+
+These often say the same thing:
+- "What we protect against" + "What you don't need to trust"
+- "Security guarantees" + "What attestation proves"
+
+Combine into single sections. One detailed explanation, brief references elsewhere.
+
+## README Structure
+
+### Order Matters
+- **Quick Start before Prerequisites** — Lead with what it does, not setup
+- **How It Works after Quick Start** — Users want to run it first, understand later
+- Cleanup at the end, Further Reading last
+
+### Don't Duplicate
+- Link to conceptual docs instead of repeating content
+- If an overview README duplicates an example README, cut the overview
+- One detailed explanation, brief references elsewhere
+
+### Remove Unrealistic Sections
+- If most users can't actually do something (e.g., run locally without special hardware), don't include it
+- Don't document workflows that require resources users don't have
+
+### Match the Workflow to the User
+- Use tools your audience already knows (e.g., Jupyter for ML practitioners)
+- Prefer official/existing images when they exist — don't reinvent
+- Make the correct path the default, mention alternatives briefly
+
+## Code Examples
+
+### Question Every Snippet
+- Does this code actually demonstrate something meaningful?
+- Would a reader understand what it does without the prose?
+- `do_thing(b"magic-string")` means nothing — show real use or remove it
+
+### Diagrams
+- Mermaid over ASCII art — GitHub renders it nicely
+- Keep diagrams simple — 3-5 nodes max
+- Label edges with actions, not just arrows
+
+## Conciseness
+
+### Less is More
+- 30 lines beats 150 if it says the same thing
+- Cut sections that don't help users accomplish their goal
+- Tables for reference, prose for explanation — don't over-table
+
+### Performance and Benchmarks
+- One memorable number + link to full report
+- Don't overwhelm with data the reader didn't ask for
+
+### Reader-First Writing
+- Ask "what does the reader want to know?" not "what do I want to say?"
+- If a section answers a question nobody asked, cut it
+
+## Maintenance
+
+### Consistency Checks
+- After terminology changes, grep for related terms across all files
+- Use correct industry/vendor terminology (e.g., "Confidential Computing" not "Encrypted Computing")
+
+### Clean Up Old Files
+- When approach changes, delete orphaned files (old scripts, Dockerfiles)
+- Don't leave artifacts from previous implementations
diff --git a/.gitignore b/.gitignore
@@ -9,3 +9,4 @@ node_modules/
 /tmp
 .claude/settings.local.json
 __pycache__
+.planning/
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -224,3 +224,9 @@ RPC definitions use `prpc` framework with Protocol Buffers:
 - Design decisions: `docs/design-and-hardening-decisions.md`
 
 When need more detailed info, try to use deepwiki mcp.
+
+## Agent Resources
+
+The `.agent/` directory contains AI assistant resources:
+- `WRITING_GUIDE.md` — Documentation and README writing guidelines (messaging, style, audiences)
+- `GPU_TEE_DEPLOYMENT.md` — GPU deployment to Phala Cloud (instance types, docker-compose config, debugging)
diff --git a/Cargo.lock b/Cargo.lock