Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
255 changes: 255 additions & 0 deletions skills/model-hierarchy/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,255 @@
---
name: model-hierarchy
description: >
Cost-optimize AI agent operations by routing tasks to appropriate models based on complexity.
Use this skill when: (1) deciding which model to use for a task, (2) spawning sub-agents,
(3) considering cost efficiency, (4) the current model feels like overkill for the task.
Triggers: "model routing", "cost optimization", "which model", "too expensive", "spawn agent".

---

# Model Hierarchy

Route tasks to the cheapest model that can handle them. Most agent work is routine.

## Core Principle

**80% of agent tasks are janitorial.** File reads, status checks, formatting, simple Q&A. These don't need expensive models. Reserve premium models for problems that actually require deep reasoning.

## Model Tiers

### Tier 1: Cheap ($0.10-0.50/M tokens)

| Model | Input | Output | Best For |
|-------|-------|--------|----------|
| DeepSeek V3 | $0.14 | $0.28 | General routine work |
| GPT-4o-mini | $0.15 | $0.60 | Quick responses |
| Claude Haiku | $0.25 | $1.25 | Fast tool use |
| Gemini Flash | $0.075 | $0.30 | High volume |

### Tier 2: Mid ($1-5/M tokens)

| Model | Input | Output | Best For |
|-------|-------|--------|----------|
| Claude Sonnet | $3.00 | $15.00 | Balanced performance |
| GPT-4o | $2.50 | $10.00 | Multimodal tasks |
| Gemini Pro | $1.25 | $5.00 | Long context |

### Tier 3: Premium ($10-75/M tokens)

| Model | Input | Output | Best For |
|-------|-------|--------|----------|
| Claude Opus | $15.00 | $75.00 | Complex reasoning |
| GPT-4.5 | $75.00 | $150.00 | Frontier tasks |
| o1 | $15.00 | $60.00 | Multi-step reasoning |
| o3-mini | $1.10 | $4.40 | Reasoning on budget |

*Prices as of Feb 2026. Check provider docs for current rates.*

## Task Classification

Before executing any task, classify it:

### ROUTINE → Use Tier 1

Characteristics:
- Single-step operations
- Clear, unambiguous instructions
- No judgment required
- Deterministic output expected

Examples:
- File read/write operations
- Status checks and health monitoring
- Simple lookups (time, weather, definitions)
- Formatting and restructuring text
- List operations (filter, sort, transform)
- API calls with known parameters
- Heartbeat and cron tasks
- URL fetching and basic parsing

### MODERATE → Use Tier 2

Characteristics:
- Multi-step but well-defined
- Some synthesis required
- Standard patterns apply
- Quality matters but isn't critical

Examples:
- Code generation (standard patterns)
- Summarization and synthesis
- Draft writing (emails, docs, messages)
- Data analysis and transformation
- Multi-file operations
- Tool orchestration
- Code review (non-security)
- Search and research tasks

### COMPLEX → Use Tier 3

Characteristics:
- Novel problem solving required
- Multiple valid approaches
- Nuanced judgment calls
- High stakes or irreversible
- Previous attempts failed

Examples:
- Multi-step debugging
- Architecture and design decisions
- Security-sensitive code review
- Tasks where cheaper model already failed
- Ambiguous requirements needing interpretation
- Long-context reasoning (>50K tokens)
- Creative work requiring originality
- Adversarial or edge-case handling

## Decision Algorithm

```
function selectModel(task):
# Rule 1: Escalation override
if task.previousAttemptFailed:
return nextTierUp(task.previousModel)

# Rule 2: Explicit complexity signals
if task.hasSignal("debug", "architect", "design", "security"):
return TIER_3

if task.hasSignal("write", "code", "summarize", "analyze"):
return TIER_2

# Rule 3: Default classification
complexity = classifyTask(task)

if complexity == ROUTINE:
return TIER_1
elif complexity == MODERATE:
return TIER_2
else:
return TIER_3
```

## Behavioral Rules

### For Main Session

1. **Default to Tier 2** for interactive work
2. **Suggest downgrade** when doing routine work: "This is routine - I can handle this on a cheaper model or spawn a sub-agent."
3. **Request upgrade** when stuck: "This needs more reasoning power. Switching to [premium model]."

### For Sub-Agents

1. **Default to Tier 1** unless task is clearly moderate+
2. **Batch similar tasks** to amortize overhead
3. **Report failures** back to parent for escalation

### For Automated Tasks

1. **Heartbeats/monitoring** → Always Tier 1
2. **Scheduled reports** → Tier 1 or 2 based on complexity
3. **Alert responses** → Start Tier 2, escalate if needed

## Communication Patterns

When suggesting model changes, use clear language:

**Downgrade suggestion:**
> "This looks like routine file work. Want me to spawn a sub-agent on DeepSeek for this? Same result, fraction of the cost."

**Upgrade request:**
> "I'm hitting the limits of what I can figure out here. This needs Opus-level reasoning. Switching up."

**Explaining hierarchy:**
> "I'm running the heavy analysis on Sonnet while sub-agents fetch the data on DeepSeek. Keeps costs down without sacrificing quality where it matters."

## Cost Impact

Assuming 100K tokens/day average usage:

| Strategy | Monthly Cost | Notes |
|----------|--------------|-------|
| Pure Opus | ~$225 | Maximum capability, maximum spend |
| Pure Sonnet | ~$45 | Good default for most work |
| Pure DeepSeek | ~$8 | Cheap but limited on hard problems |
| **Hierarchy (80/15/5)** | **~$19** | Best of all worlds |

The 80/15/5 split:
- 80% routine tasks on Tier 1 (~$6)
- 15% moderate tasks on Tier 2 (~$7)
- 5% complex tasks on Tier 3 (~$6)

**Result: 10x cost reduction vs pure premium, with equivalent quality on complex tasks.**

## Integration Examples

### OpenClaw

```yaml
# config.yml - set default model
model: anthropic/claude-sonnet-4

# In session, switch models
/model opus # upgrade for complex task
/model deepseek # downgrade for routine

# Spawn sub-agent on cheap model
sessions_spawn:
task: "Fetch and parse these 50 URLs"
model: deepseek
```

### Claude Code

```
# In CLAUDE.md or project instructions
When spawning background agents, use claude-3-haiku for:
- File operations
- Simple searches
- Status checks

Reserve claude-sonnet-4 for:
- Code generation
- Analysis tasks
```

### General Agent Systems

```python
def get_model_for_task(task_description: str) -> str:
routine_signals = ['read', 'fetch', 'check', 'list', 'format', 'status']
complex_signals = ['debug', 'architect', 'design', 'security', 'why']

desc_lower = task_description.lower()

if any(signal in desc_lower for signal in complex_signals):
return "claude-opus-4"
elif any(signal in desc_lower for signal in routine_signals):
return "deepseek-v3"
else:
return "claude-sonnet-4"
```

## Anti-Patterns

**DON'T:**
- Run heartbeats on Opus
- Use premium models for file I/O
- Keep expensive model when task is clearly routine
- Spawn sub-agents on premium models by default

**DO:**
- Start mid-tier, adjust based on task
- Spawn helpers on cheapest viable model
- Escalate explicitly when stuck
- Track cost per task type to optimize further

## Extending This Skill

To customize for your use case:

1. **Adjust tier definitions** based on your provider/budget
2. **Add domain-specific signals** to classification rules
3. **Track actual complexity** vs predicted to improve heuristics
4. **Set budget alerts** to catch runaway premium usage