Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
235 changes: 235 additions & 0 deletions .agent/GPU_TEE_DEPLOYMENT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,235 @@
# GPU TEE Deployment Guide

Learnings from deploying GPU workloads to Phala Cloud TEE infrastructure.

## Instance Types

Query available instance types:
```bash
curl -s "https://cloud-api.phala.network/api/v1/instance-types" | jq
```

### CPU-only (Intel TDX)
- `tdx.small` through `tdx.8xlarge`

### GPU (H200 + TDX)
- `h200.small` — Single H200 GPU, suitable for inference
- `h200.16xlarge` — Multi-GPU for larger workloads
- `h200.8x.large` — High-memory configuration

## Deployment Commands

### GPU Deployment
```bash
phala deploy -n my-app -c docker-compose.yaml \
--instance-type h200.small \
--region US-EAST-1 \
--image dstack-nvidia-dev-0.5.4.1
```

Key flags:
- `--instance-type h200.small` — Required for GPU access
- `--image dstack-nvidia-dev-0.5.4.1` — NVIDIA development image with GPU drivers
- `--region US-EAST-1` — Region with GPU nodes (gpu-use2)

### Debugging
```bash
# Check CVM status
phala cvms list

# View serial logs (boot + container output)
phala cvms serial-logs <app_id> --tail 100

# Delete CVM
phala cvms delete <name-or-id> --force
```

## Docker Compose GPU Configuration

GPU devices must be explicitly reserved in docker-compose.yaml:

```yaml
services:
my-gpu-app:
image: my-image
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
```

Without the `deploy.resources.reservations.devices` section, the container will fail with:
```
libcuda.so.1: cannot open shared object file: No such file or directory
```

## vLLM Example

Working docker-compose.yaml for vLLM inference:

```yaml
services:
vllm:
image: vllm/vllm-openai:latest
volumes:
- /var/run/dstack.sock:/var/run/dstack.sock
environment:
- NVIDIA_VISIBLE_DEVICES=all
- HF_TOKEN=${HF_TOKEN:-}
ports:
- "8000:8000"
command: >
--model Qwen/Qwen2.5-1.5B-Instruct
--host 0.0.0.0
--port 8000
--max-model-len 4096
--gpu-memory-utilization 0.8
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
```

## Endpoint URLs

After deployment, the app is accessible at:
```
https://<app_id>-<port>.dstack-pha-<region>.phala.network
```

Example for vLLM on port 8000:
```bash
# List models
curl https://<app_id>-8000.dstack-pha-use2.phala.network/v1/models

# Chat completion
curl -X POST https://<app_id>-8000.dstack-pha-use2.phala.network/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'
```

## vllm-proxy (Response Signing)

vllm-proxy provides response signing and attestation for vLLM inference. It sits between clients and vLLM, signing responses with TEE-derived keys.

### Configuration

**IMPORTANT**: The authentication environment variable is `TOKEN`, not `AUTH_TOKEN`.

```yaml
services:
vllm:
image: vllm/vllm-openai:latest
environment:
- NVIDIA_VISIBLE_DEVICES=all
command: >
--model Qwen/Qwen2.5-1.5B-Instruct
--host 0.0.0.0
--port 8000
--max-model-len 4096
--gpu-memory-utilization 0.8
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]

proxy:
image: phalanetwork/vllm-proxy:v0.2.18
volumes:
- /var/run/dstack.sock:/var/run/dstack.sock # Required for TEE key derivation
environment:
- VLLM_BASE_URL=http://vllm:8000
- MODEL_NAME=Qwen/Qwen2.5-1.5B-Instruct
- TOKEN=your-secret-token # NOT AUTH_TOKEN
ports:
- "8000:8000"
depends_on:
- vllm
```

### API Endpoints

```bash
# List models (no auth required)
curl https://<endpoint>/v1/models

# Chat completion (requires auth)
curl -X POST https://<endpoint>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your-secret-token" \
-d '{"model": "Qwen/Qwen2.5-1.5B-Instruct", "messages": [{"role": "user", "content": "Hello"}]}'

# Get response signature
curl https://<endpoint>/v1/signature/<chat_id> \
-H "Authorization: Bearer your-secret-token"

# Attestation report
curl https://<endpoint>/v1/attestation/report \
-H "Authorization: Bearer your-secret-token"
```

### Tested Configuration

- Image: `phalanetwork/vllm-proxy:v0.2.18`
- Instance: `h200.small`
- Region: `US-EAST-1`
- Model: `Qwen/Qwen2.5-1.5B-Instruct`

### vllm-proxy Issues

**"Invalid token" error**:
- Check that you're using `TOKEN` environment variable, not `AUTH_TOKEN`
- Verify the token value matches your request header

**"All connection attempts failed" from proxy**:
- vLLM is still loading the model (takes 1-2 minutes after container starts)
- Wait for vLLM to show "Uvicorn running on" in serial logs

**NVML error on attestation**:
- GPU confidential computing attestation may not be fully available
- This doesn't affect inference or response signing

## Common Issues

### "No available resources match your requirements"
- GPU nodes are limited. Wait for other CVMs to finish or try a different region.
- Ensure you're using the correct instance type (`h200.small`).

### Container crashes with GPU errors
- Add `deploy.resources.reservations.devices` section to docker-compose.yaml.
- Verify using NVIDIA development image (`dstack-nvidia-dev-*`).

### Image pull takes too long
- Large images (5GB+ for vLLM) take 3-5 minutes to download and extract.
- Check serial logs for progress.

## Testing Workflow

1. Deploy: `phala deploy -n test -c docker-compose.yaml --instance-type h200.small --region US-EAST-1 --image dstack-nvidia-dev-0.5.4.1`
2. Wait for status: `phala cvms list` (wait for "running")
3. Check logs: `phala cvms serial-logs <app_id> --tail 100`
4. Test API: `curl https://<app_id>-<port>.dstack-pha-use2.phala.network/...`
5. Cleanup: `phala cvms delete <name> --force`

## GPU Wrapper Script

For repeated GPU deployments, use a wrapper script:

```bash
#!/bin/bash
# phala-gpu.sh
source "$(dirname "$0")/.env"
export PHALA_CLOUD_API_KEY=$PHALA_CLOUD_API_GPU
phala "$@"
```

This allows maintaining separate API keys for CPU and GPU workspaces.
137 changes: 137 additions & 0 deletions .agent/WRITING_GUIDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Documentation Writing Guide

Guidelines for writing dstack documentation, README, and marketing content.

## Writing Style

- **Don't over-explain** why a framework is needed — assert the solution, hint at alternatives being insufficient
- **Avoid analogies as taglines** (e.g., "X for Y") — if it's a new category, don't frame it as a better version of something else
- **Problem → Solution flow** without explicit labels like "The problem:" or "The solution:"
- **Demonstrate features through actions**, not parenthetical annotations
- Bad: "Generates quotes (enabling *workload identity*)"
- Good: "Generates TDX attestation quotes so users can verify exactly what's running"

## Procedural Documentation (Guides & Tutorials)

### Test Before You Document
- **Run every command** before documenting it — reading code is not enough
- Commands may prompt for confirmation, require undocumented env vars, or fail silently
- Create a test environment and execute the full flow end-to-end

### Show What Success Looks Like
- **Add sample outputs** after commands so users can verify they're on track
- For deployment commands, show the key values users need to note (addresses, IDs)
- For validation commands, show both success and failure outputs

### Environment Variables
- **List all required env vars explicitly** — don't assume users will discover them
- If multiple tools use similar-but-different var names, clarify which is which
- Show the export pattern once, then reference it in subsequent commands

### Avoid Expert Blind Spots
- If you say "add the hash", explain how to compute the hash
- If you reference a file, explain where to find it
- If a value comes from a previous step, remind users which step

### Cross-Reference Related Docs
- Link to prerequisite guides (don't repeat content)
- Link to detailed guides for optional deep-dives
- Use anchor links for specific sections when possible

## Security Documentation

### Trust Model Framing

**Distinguish trust from verification:**
- "Trust" = cannot be verified, must assume correct (e.g., hardware)
- "Verify" = can be cryptographically proven (e.g., measured software)

**Correct framing:**
- Bad: "You must trust the OS" (when it's verifiable)
- Good: "The OS is measured during boot and recorded in the attestation quote. You verify it by..."

### Limitations: Be Honest, Not Alarmist

State limitations plainly without false mitigations:
- Bad: "X is a single point of failure. Mitigate by running your own X."
- Good: "X is protected by [mechanism]. Like all [category] systems, [inherent limitation]. We are developing [actual solution] to address this."

Don't suggest mitigations that don't actually help. If something is an inherent limitation of the technology, say so.

## Documentation Quality Checklist

From doc-requirements.md:

1. **No bullet point walls** — Max 3-5 bullets before breaking with prose
2. **No redundancy** — Don't present same info from opposite perspectives
3. **Conversational language** — Write like explaining to a peer
4. **Short paragraphs** — Max 4 sentences per paragraph
5. **Lead with key takeaway** — First sentence tells reader why this matters
6. **Active voice** — "TEE encrypts memory" not "Memory is encrypted by TEE"
7. **Minimal em-dashes** — Max 1-2 per page, replace with "because", "so", or separate sentences

### Redundancy Patterns to Avoid

These often say the same thing:
- "What we protect against" + "What you don't need to trust"
- "Security guarantees" + "What attestation proves"

Combine into single sections. One detailed explanation, brief references elsewhere.

## README Structure

### Order Matters
- **Quick Start before Prerequisites** — Lead with what it does, not setup
- **How It Works after Quick Start** — Users want to run it first, understand later
- Cleanup at the end, Further Reading last

### Don't Duplicate
- Link to conceptual docs instead of repeating content
- If an overview README duplicates an example README, cut the overview
- One detailed explanation, brief references elsewhere

### Remove Unrealistic Sections
- If most users can't actually do something (e.g., run locally without special hardware), don't include it
- Don't document workflows that require resources users don't have

### Match the Workflow to the User
- Use tools your audience already knows (e.g., Jupyter for ML practitioners)
- Prefer official/existing images when they exist — don't reinvent
- Make the correct path the default, mention alternatives briefly

## Code Examples

### Question Every Snippet
- Does this code actually demonstrate something meaningful?
- Would a reader understand what it does without the prose?
- `do_thing(b"magic-string")` means nothing — show real use or remove it

### Diagrams
- Mermaid over ASCII art — GitHub renders it nicely
- Keep diagrams simple — 3-5 nodes max
- Label edges with actions, not just arrows

## Conciseness

### Less is More
- 30 lines beats 150 if it says the same thing
- Cut sections that don't help users accomplish their goal
- Tables for reference, prose for explanation — don't over-table

### Performance and Benchmarks
- One memorable number + link to full report
- Don't overwhelm with data the reader didn't ask for

### Reader-First Writing
- Ask "what does the reader want to know?" not "what do I want to say?"
- If a section answers a question nobody asked, cut it

## Maintenance

### Consistency Checks
- After terminology changes, grep for related terms across all files
- Use correct industry/vendor terminology (e.g., "Confidential Computing" not "Encrypted Computing")

### Clean Up Old Files
- When approach changes, delete orphaned files (old scripts, Dockerfiles)
- Don't leave artifacts from previous implementations
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,4 @@ node_modules/
/tmp
.claude/settings.local.json
__pycache__
.planning/
6 changes: 6 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,3 +224,9 @@ RPC definitions use `prpc` framework with Protocol Buffers:
- Design decisions: `docs/design-and-hardening-decisions.md`

When need more detailed info, try to use deepwiki mcp.

## Agent Resources

The `.agent/` directory contains AI assistant resources:
- `WRITING_GUIDE.md` — Documentation and README writing guidelines (messaging, style, audiences)
- `GPU_TEE_DEPLOYMENT.md` — GPU deployment to Phala Cloud (instance types, docker-compose config, debugging)
3 changes: 1 addition & 2 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading