diff --git a/README.md b/README.md
index 2f45fe7..c742fa6 100644
--- a/README.md
+++ b/README.md
@@ -3,56 +3,163 @@
[](https://creativecommons.org/licenses/by/4.0/)
[]()
-> **Architectural standards and best practices for building reliable AI Agents and LLM workflows. Defining the framework for AI Reliability Engineering (AIRE).**
+> **An open implementation guide for building reliable AI Agents at scale. Defining the practices for AI Reliability Engineering (AIRE).**
---
## Introduction
-As AI systems move from "experimental" prototypes to "mission-critical" production environments, reliability has emerged as the single biggest barrier to adoption.
+As AI systems move from "experimental" prototypes to "mission-critical" production environments, reliability has emerged as the single biggest barrier to adoption.
-This repository serves as the **Open Standard for AI Reliability Engineering (AIRE)**. It documents the architectural patterns, testing frameworks, and guardrails that engineering teams use to achieve 99.9% reliability in non-deterministic systems and what does it even mean to be reliable in a non-deterministic system?. Further akin to SRE principles, this repository also documents principles for AI Reliability Engineering (AIRE)
+This repository serves as the **Open Standard for AI Reliability Engineering (AIRE)**. It documents the architectural patterns, testing frameworks, and operational practices that engineering teams use to achieve production-grade reliability in non-deterministic systems.
-It is not a theoretical academic paper. It is a living collection of **"Success Patterns"** gathered from the top 1% of engineering teams currently running agents at scale.
+It is not a theoretical academic paper. It is a living collection of **"Success Patterns"** gathered from practitioners running agents at scale.
+
+---
+
+## AIRE Principles
+
+*Guiding tenets inspired by SRE:*
+
+These five principles define the philosophical foundation of AIRE. They inform the practices detailed in the five pillars and help teams make trade-off decisions when designing reliable AI systems.
+
+### 1. Embrace Non-Determinism
+
+Accept that identical inputs will produce variable outputs. Design systems that succeed despite variance, not systems that assume consistency.
+
+**Key Insight:** AI systems are probabilistic reasoners. Don't try to make them deterministic-build resilience around their non-determinism through structured outputs, guardrails, and fallback paths.
+
+### 2. Reliability is a Feature
+
+Reliability competes with velocity for engineering resources. Treat it as a first-class product requirement with explicit budgets, not an afterthought.
+
+**Key Insight:** Allocate dedicated engineering time (e.g., 20% of sprints) to reliability work: golden dataset updates, eval pipeline maintenance, incident reviews.
+
+### 3. Measure, Don't Assume
+
+If you cannot quantify the reliability of your AI system, you do not have a reliable AI system. Intuition is not evidence.
+
+**Key Insight:** Track concrete metrics (hallucination rate <0.1%, HITL rate <10%, uptime >99.9%). Block deployments if metrics degrade.
+
+### 4. Fail Gracefully, Fail Informatively
+
+Every failure should preserve context, enable recovery, and generate learnings. Silent failures are unacceptable.
+
+**Key Insight:** Save checkpoints, log Chain of Thought reasoning, return user-friendly errors, and ensure workflows can resume after crashes.
+
+### 5. Humans as Fallback, Not Crutch
+
+Design for autonomous operation. Human escalation is a safety net for edge cases, not a substitute for robust engineering.
+
+**Key Insight:** Reduce HITL rate over time through active learning. Start at 100% human review, target <10% through continuous improvement.
+
+📖 **[Read the detailed AIRE Principles guide →](docs/principles.md)**
+
+---
## Core Pillars of AIRE
-We define the stability of an Agentic System through four core pillars:
+We define the reliability of an Agentic System through five core pillars:
+
+### 1. Resilient Architecture
+
+*Building systems that gracefully handle failures, scale under load, and recover from errors.*
+
+Resilient architecture establishes the structural foundation for reliable AI systems. It encompasses:
+
+- **Elastic Auto-Scaling** - Horizontal and vertical scaling strategies for unpredictable AI workloads
+- **State Management** - Checkpoint-based recovery enabling workflows to resume from last checkpoint after failures (not restart from scratch)
+- **Circuit Breakers** - Fault tolerance patterns that prevent cascading failures by failing fast when services degrade
+- **Fallback Paths** - Multi-tier fallback strategies (GPT-4 → GPT-3.5 → Rules → Human)
+- **The Reliability Stack Pattern** - Separating probabilistic reasoning (LLM) from deterministic safety (guardrails)
+
+**Key Metrics:** Resumability Rate >99%, Circuit Breaker Activations <10/day, Fallback Usage Rate <15%, MTTR <5 minutes
+
+📖 **[Read the full Resilient Architecture guide →](docs/pillars/resilient-architecture.md)**
-### 1. The Reliability Stack (Architecture)
-*Separating the "Brain" from the "Governor".*
+---
+
+### 2. Cognitive Reliability
+
+*Ensuring AI agents produce accurate, consistent, and trustworthy outputs.*
+
+Cognitive reliability addresses the correctness problem - ensuring outputs are grounded, validated, and trustworthy:
-* **The Core Stack:** Probabilistic components (LLMs, Prompts) focused on reasoning.
-* **The Reliability Stack:** Deterministic components (Guardrails, Durable Queues, Verifiers) focused on safety.
-* **Principle:** Never rely on the LLM to police itself.
+- **Self-Reflection & Correction** - Chain-of-thought with reflection, multi-agent debate for high-stakes decisions
+- **Structured Outputs** - JSON schema validation, forced choice enums, regex-constrained generation
+- **Human-in-the-Loop (HITL) Protocols** - Confidence-based escalation with design patterns to reduce HITL over time through active learning
+- **Drift Detection** - Input drift (distribution changes), output drift (confidence shifts), model drift (version changes)
-### 2. Eval-Driven Development (EDD)
-*Moving from "Vibes" to Engineering.*
+**Key Metrics:** Hallucination Rate <0.1%, Groundedness >95%, HITL Rate <10%, Confidence Calibration within 10%
-* **Golden Datasets:** Regression suites of 100+ questions run before every deploy.
-* **Unit Testing Agents:** Synthetic data tests for specific skills (e.g., API calling syntax).
-* **Metrics:** Standardized scoring for Hallucination Rate (<0.1%) and Groundedness.
+📖 **[Read the full Cognitive Reliability guide →](docs/pillars/cognitive-reliability.md)**
-### 3. Durable Execution & State
-*Fault tolerance for long-running workflows.*
+---
+
+### 3. Quality & Lifecycle
-* **Resumability:** If an agent crashes on Step 4 of 10, it must resume at Step 4, not restart.
-* **Graceful Degradation:** Protocols for handing off to humans with full context when confidence drops.
+*Moving from "vibes-based" development to rigorous testing and continuous improvement.*
-### 4. Observability 2.0
-*Tracing the "Thought" Process.*
+Quality & Lifecycle practices define how to test, deploy, and continuously improve AI systems:
-* **Chain of Thought (CoT) Logging:** Tracing logic, not just I/O.
-* **Cost Observability:** Real-time token tracking per tenant/workflow.
+- **Evals-Driven Deployments** - CI/CD gates with golden datasets, staged rollouts (canary → gradual → full), automatic rollback triggers
+- **Golden Datasets** - Curated regression suites (60% core capabilities, 30% edge cases, 10% adversarial), versioned in Git, continuously updated
+- **Unit Testing Agents** - Tool calling tests, prompt adherence tests, synthetic data tests
+- **Online vs Offline Evals** - Pre-deployment regression testing (offline) + post-deployment drift detection (online)
+- **Feedback Loops** - Production failures → HITL corrections → golden dataset updates → model retraining
-### 5. Principles of AIRE
-*Guiding tenets for AI Reliability Engineering, inspired by SRE.*
+**Key Metrics:** Golden Dataset Accuracy >95%, Deployment Success Rate >90%, User Satisfaction >80%, Feedback Loop Latency <7 days
-* **Embrace Non-Determinism:** Accept that identical inputs will produce variable outputs. Design systems that succeed despite variance, not systems that assume consistency.
-* **Reliability is a Feature:** Reliability competes with velocity for engineering resources. Treat it as a first-class product requirement with explicit budgets, not an afterthought.
-* **Measure, Don't Assume:** If you cannot quantify the reliability of your AI system, you do not have a reliable AI system. Intuition is not evidence.
-* **Fail Gracefully, Fail Informatively:** Every failure should preserve context, enable recovery, and generate learnings. Silent failures are unacceptable.
-* **Humans as Fallback, Not Crutch:** Design for autonomous operation. Human escalation is a safety net for edge cases, not a substitute for robust engineering.
+📖 **[Read the full Quality & Lifecycle guide →](docs/pillars/quality-lifecycle.md)**
+
+---
+
+### 4. Security
+
+*Protecting systems, data, and users from risks introduced by autonomous agents.*
+
+Security for AI agents differs from traditional software-agents are autonomous decision-makers that can be manipulated to exceed intended authority:
+
+- **Just-in-Time (JIT) Privilege Access** - Scoped tokens (action + resourceId) with automatic expiration (<5 minutes), step-up authentication for high-risk actions
+- **Audit Logs for Internal Thinking** - Logging reasoning (Chain of Thought), not just inputs/outputs; structured logs for incident investigation
+- **Guardrails** - Deterministic hard stops at three layers: input guardrails (prompt injection detection, PII redaction), output guardrails (sensitive data leakage prevention), action guardrails (rate limits, monetary limits)
+- **Prompt Injection Defenses** - Instruction hierarchy, input sanitization, multi-model validation, sandboxing
+- **Data Privacy in Context Windows** - Context isolation per session, PII redaction, ephemeral context for sensitive data, encryption at rest, GDPR compliance
+
+**Key Metrics:** Prompt Injection Attempts <10/day, Jailbreak Success Rate <0.1%, PII Leakage Incidents 0, MTTD <5 minutes
+
+📖 **[Read the full Security guide →](docs/pillars/security.md)**
+
+---
+
+### 5. Operational Excellence & Team Culture
+
+*Establishing SLAs, error budgets, team structures, and operational practices that enable reliable AI systems to scale.*
+
+Operational Excellence bridges the gap between technical architecture and organizational culture. While the first four pillars define *what* to build, this pillar defines *how* teams operate, measure, and continuously improve AI systems at scale:
+
+- **AI-Specific SLAs & Error Budgets** - Service Level Objectives for availability, latency, quality, safety, and efficiency; error budget policies for balancing reliability with innovation velocity
+- **Team Structure & Shared Responsibility** - Product teams own agents end-to-end; embedded AI Reliability Engineers (AIREs) with 20% time allocation; central platform team provides infrastructure
+- **Progressive Autonomy Maturity Model** - Five levels of agent autonomy (L0: Human-Driven → L4: Autonomous), reducing HITL rate from 100% to <5% over time
+- **Reliability Reviews** - Weekly metric reviews, monthly postmortems, error budget tracking, SLO compliance monitoring
+
+**Key Metrics:** SLO Compliance >95%, Error Budget Remaining >25%, HITL Rate <10%, Autonomy Level L3+, Time to Autonomy <6 months
+
+📖 **[Read the full Operational Excellence guide →](docs/pillars/operational-excellence.md)**
+
+---
+
+## Getting Started
+
+**New to AIRE?** Start with the **[Getting Started Guide →](docs/getting-started.md)** for a step-by-step adoption roadmap:
+
+- **Phase 1 (Week 1-2):** Assess current state, measure baseline metrics
+- **Phase 2 (Month 1):** Quick wins - golden dataset, guardrails, audit logging
+- **Phase 3 (Month 2-3):** Foundation - circuit breakers, state persistence, CI/CD evals
+- **Phase 4 (Month 4-6):** Maturity - feedback loops, drift detection, JIT access
+- **Phase 5 (Month 6+):** Excellence - hallucination rate <0.1%, HITL rate <10%, uptime 99.9%+
+
+**Want to dive deep?** Explore the [complete documentation →](https://aire.exosphere.host)
---
@@ -71,16 +178,29 @@ You get to shape the future of AI reliability engineering and get recognized for
| Benefit | Details |
|---------|---------|
| **Shape the Standard** | Your operational insights become codified best practices. Influence how the industry approaches AI reliability for years to come |
-| **Industry Recognition** | Listed in the [Contributors Registry](CONTRIBUTORS.md) as a contributor to the standards of AI relibility |
+| **Industry Recognition** | Listed in the [Contributors Registry](CONTRIBUTORS.md) as a contributor to the standards of AI reliability |
| **Peer Network** | Join a private forum of engineering leaders exchanging reliability patterns across enterprises |
| **Early Access** | Preview new sections and reference architectures before public release |
| **Thank you gift** | We will send you a gift hamper courtesy to our sponsors |
---
-## Repository Structure (Coming Soon)
-
-We are actively populating this repository with the success patterns from the study and the playbook.
+## Repository Structure
+
+```
+docs/
+├── getting-started.md # Adoption roadmap for organizations
+├── pillars/
+│ ├── resilient-architecture.md # Pillar 1: Fault tolerance, scaling, recovery
+│ ├── cognitive-reliability.md # Pillar 2: Accuracy, consistency, drift detection
+│ ├── quality-lifecycle.md # Pillar 3: Testing, deployment, feedback loops
+│ ├── security.md # Pillar 4: JIT access, guardrails, audit logs
+│ └── operational-excellence.md # Pillar 5: SLAs, team structure, progressive autonomy
+└── appendix/
+ ├── principles.md # AIRE Principles (5 guiding tenets)
+ ├── metrics-framework.md # Three-tier metrics framework
+ └── glossary.md # Key terms and definitions
+```
---
@@ -97,10 +217,10 @@ We welcome Pull Requests (PRs) from engineers who have solved specific reliabili
-Contact nivedit@exosphere.host to sponsor this work.
+Contact nikita@exosphere.host to sponsor this work.
## License
This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
-You are free to share and adapt this material for any purpose, even commercially, as long as you give appropriate credit.
\ No newline at end of file
+You are free to share and adapt this material for any purpose, even commercially, as long as you give appropriate credit.
diff --git a/docs/appendix/glossary.md b/docs/appendix/glossary.md
new file mode 100644
index 0000000..b5a9410
--- /dev/null
+++ b/docs/appendix/glossary.md
@@ -0,0 +1,169 @@
+# Glossary
+
+Key terms used throughout the AIRE Standards.
+
+---
+
+## A
+
+**Active Learning**
+Process where an AI system learns from human corrections (HITL feedback) to improve over time. Corrections are added to the golden dataset and used to retrain the model.
+
+**AIRE**
+AI Reliability Engineering. The discipline of building and operating reliable AI agents, inspired by Site Reliability Engineering (SRE).
+
+**Agent**
+An autonomous AI system that uses an LLM for reasoning and can take actions (call APIs, query databases, etc.) on behalf of users.
+
+---
+
+## C
+
+**Chain of Thought (CoT)**
+The LLM's step-by-step reasoning process before arriving at a final answer. CoT logging captures this reasoning for debugging and observability.
+
+**Circuit Breaker**
+Fault tolerance pattern that stops calling a degraded service after repeated failures, preventing cascading failures. Has three states: Closed (normal), Open (failing fast), Half-Open (testing recovery).
+
+**Cognitive Reliability**
+The practice of ensuring AI agent outputs are accurate, consistent, and trustworthy through validation, self-reflection, and drift detection.
+
+**Confidence Calibration**
+Ensuring an agent's reported confidence score correlates with actual accuracy. A well-calibrated agent with 90% confidence should be correct 90% of the time.
+
+---
+
+## D
+
+**Drift Detection**
+Monitoring for changes in input distributions, output characteristics, or model performance over time. Types include input drift, output drift, and model drift.
+
+**Durable Execution**
+Workflow execution pattern where state is persisted so workflows can resume from the last checkpoint after failures, rather than restarting from scratch.
+
+---
+
+## E
+
+**EDD (Eval-Driven Development)**
+AI equivalent of Test-Driven Development (TDD). Development methodology where golden datasets and evals guide development, and deployments are blocked if evals fail.
+
+**Ephemeral Context**
+Context (prompts, conversation history) that is processed but never persisted to logs or storage, used for highly sensitive data like medical records.
+
+---
+
+## F
+
+**Fallback Path**
+Alternative strategy when primary system fails. Example: GPT-4 → GPT-3.5 → Rule-based system → Human.
+
+**Feedback Loop**
+System where production failures and HITL corrections are collected, added to the golden dataset, and used to retrain models, creating continuous improvement.
+
+---
+
+## G
+
+**Golden Dataset**
+Curated collection of inputs with labeled expected outputs, used as a regression suite for evaluating model performance before deployment. Typically 100+ examples.
+
+**Groundedness**
+Percentage of claims in an agent's output that are supported by retrieved context or known facts. Target: >95%.
+
+**Guardrails**
+Deterministic hard limits that constrain agent behavior regardless of LLM reasoning. Types: input guardrails (PII redaction, prompt injection detection), output guardrails (sensitive data leakage prevention), action guardrails (rate limits, monetary limits).
+
+---
+
+## H
+
+**Hallucination**
+When an LLM generates factually incorrect or fabricated information presented as fact.
+
+**Hallucination Rate**
+Percentage of agent outputs containing factual errors. Target: <0.1%.
+
+**HITL (Human-in-the-Loop)**
+Design pattern where humans review or approve agent actions before execution, typically for high-stakes decisions or low-confidence outputs.
+
+---
+
+## I
+
+**Idempotency Token**
+Unique identifier that ensures retrying a failed operation doesn't duplicate side effects (e.g., double-charging a customer).
+
+---
+
+## J
+
+**JIT (Just-in-Time) Privilege Access**
+Security pattern where agents are granted minimum necessary privileges scoped to specific actions with automatic expiration (typically <5 minutes).
+
+**Jailbreak**
+Attack where adversarial input manipulates the LLM to bypass intended constraints or reveal system prompts.
+
+---
+
+## M
+
+**MTTD (Mean Time to Detect)**
+Average time from when an issue occurs to when it's detected. Target: <5 minutes.
+
+**MTTR (Mean Time to Recovery)**
+Average time from failure detection to full recovery. Target: <5 minutes.
+
+---
+
+## O
+
+**Offline Evals**
+Pre-deployment evaluations run on a golden dataset in CI/CD to catch regressions before they reach production. Cheap and reproducible.
+
+**Online Evals**
+Post-deployment evaluations run on production traffic to detect drift and real-world failures. More expensive but catches unknown issues.
+
+---
+
+## P
+
+**Prompt Injection**
+Attack where user input contains instructions that override the system prompt, causing the agent to behave incorrectly.
+
+---
+
+## R
+
+**Resumability**
+Ability of a workflow to resume from the last checkpoint after a failure, rather than restarting from scratch.
+
+**Rollback**
+Reverting to a previous version of a model or system after detecting performance degradation in a new deployment.
+
+---
+
+## S
+
+**Scoped Token**
+API key or access token limited to specific actions and resources (e.g., "refundOrder:12345"), reducing blast radius of security breaches.
+
+**Self-Reflection**
+Pattern where an LLM critiques its own output before finalizing, useful for high-stakes decisions. Example: Agent generates answer → critiques it → revises answer.
+
+**Staged Rollout**
+Deployment strategy where new models are rolled out gradually (5% → 50% → 100% traffic) with monitoring at each stage.
+
+**Structured Output**
+LLM output constrained to a specific format (JSON schema, enum) to enable deterministic validation. Example: `{"answer": "...", "confidence": 0.9}`.
+
+---
+
+## T
+
+**Tool**
+External function or API that an agent can call to perform actions (e.g., `getWeather()`, `sendEmail()`, `queryDatabase()`).
+
+---
+
+*This glossary is part of the [AI Reliability Engineering (AIRE) Standards](../index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/docs/appendix/metrics-framework.md b/docs/appendix/metrics-framework.md
new file mode 100644
index 0000000..0709873
--- /dev/null
+++ b/docs/appendix/metrics-framework.md
@@ -0,0 +1,364 @@
+# Metrics Framework: Three-Tier Structure
+
+## Overview
+
+The AIRE Metrics Framework organizes reliability metrics into three tiers, each serving a different audience and purpose:
+
+- **Tier 1: Business Metrics** - What users and executives care about (user satisfaction, task success, cost)
+- **Tier 2: System Metrics** - Performance indicators for targets that engineering teams track (cognitive accuracy, safety integrity, autonomy level, response performance, cost efficiency)
+- **Tier 3: Component Metrics** - Debugging and optimization metrics for individual components (circuit breakers, fallbacks, drift)
+
+**Principle:** Each tier informs the next. Component metrics explain system metrics. System metrics drive business outcomes.
+
+---
+
+## Tier 1: Business Metrics
+
+**Audience:** Executives, Product Managers, Business Stakeholders
+
+**Purpose:** Measure user value and business impact. These metrics answer: "Are users happy? Is the system delivering value?"
+
+### User Satisfaction
+
+**What It Measures:** User perception of agent quality and usefulness.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **NPS (Net Promoter Score)** | Likelihood to recommend (0-10 scale) | >50 | Survey: "How likely are you to recommend this agent?" |
+| **CSAT (Customer Satisfaction)** | Satisfaction rating (1-5 scale) | >4.0 | Survey: "How satisfied are you with this interaction?" |
+| **Thumbs Up/Down Rate** | % positive vs negative feedback | >80% positive | In-app feedback buttons |
+| **User Retention** | % users who return within 30 days | >60% | Track user return rate |
+
+**Why It Matters:** High user satisfaction correlates with adoption, retention, and business value.
+
+---
+
+### Task Success Rate
+
+**What It Measures:** Did the agent successfully complete the user's request?
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Task Completion Rate** | % requests where agent completed intended task | >90% | Manual review or automated validation |
+| **Intent Recognition Accuracy** | % requests where agent understood user intent | >95% | Compare agent intent vs user-reported intent |
+| **Resolution Rate** | % queries resolved without escalation | >85% | Track escalations to human support |
+
+**Why It Matters:** Task success directly impacts user value. Low success rates indicate reliability issues.
+
+**Developing themes:**
+
+- **Subjective Success:** What counts as "successful"?
+- **Delayed Feedback:** Success may not be immediately apparent
+- **Multi-Step Tasks:** Success requires multiple steps, how to implement success metrics?
+---
+
+### Cost as a First-Class Reliability Metric
+
+**What It Measures:** Cost per successful interaction. Reliability includes cost efficiency.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Cost per Successful Interaction** | Total cost / Successful interactions | <$0.10 | Track LLM API costs + infrastructure costs |
+| **Cost per Request** | Total cost / Total requests | <$0.15 | Average cost across all requests |
+| **Cost Efficiency Trend** | Month-over-month cost change | <5% increase | Track cost trends as volume scales |
+
+**Why It Matters:** High costs limit scalability and business viability. Cost optimization is a reliability concern.
+
+**Cost Optimization Strategies:**
+
+- Model routing , using cheaper models for simple queries
+- Caching to reduce redundant LLM calls
+- Batch processing by grouping similar queries
+- Fallback strategies to use expensive models only when needed
+
+---
+
+### Business Metric Targets
+
+**Summary Table:**
+
+| Metric Category | Key Metrics | Target | Frequency |
+|-----------------|-------------|--------|-----------|
+| **User Satisfaction** | NPS, CSAT, Thumbs Up Rate | NPS >50, CSAT >4.0, >80% positive | Weekly |
+| **Task Success** | Task Completion Rate, Intent Recognition | >90% completion, >95% intent | Daily |
+| **Cost Efficiency** | Cost per Successful Interaction | <$0.10 per success | Daily |
+
+**Alert Thresholds:**
+
+- **Critical:** User satisfaction drops >10%, task success <85%, cost >$0.15/request
+- **Warning:** User satisfaction drops >5%, task success <90%, cost >$0.12/request
+
+---
+
+## Tier 2: System Metrics
+
+**Audience:** Engineering Teams, SRE, AI Reliability Engineers
+
+**Purpose:** Define performance indicators that map to performance targets. These metrics track system reliability across five dimensions: cognitive accuracy, safety integrity, autonomy level, response performance, and cost efficiency.
+
+### Availability
+
+**What It Measures:** System uptime and error rate.
+
+**Performance Indicators:**
+
+| Indicator | Definition | Target | Measurement |
+|-----------|------------|--------|-------------|
+| **Uptime** | % time system is available | 99.5% (monthly) | Successful responses / Total requests |
+| **Error Rate** | % requests that fail | <0.5% | Failed requests / Total requests |
+| **Successful Response Rate** | % requests returning valid responses | >99.5% | Valid responses / Total requests |
+
+**Error Types:**
+
+- **Service Errors:** 5xx HTTP errors, timeouts, rate limits
+- **Model Errors:** LLM API failures, model unavailable
+- **Infrastructure Errors:** Database failures, queue failures
+
+---
+
+### Latency
+
+**What It Measures:** Response time from user query to agent response.
+
+**Performance Indicators:**
+
+| Indicator | Definition | Target | Measurement |
+|-----------|------------|--------|-------------|
+| **P50 Latency** | Median response time | <2 seconds | 50th percentile response time |
+| **P95 Latency** | 95th percentile response time | <5 seconds | 95th percentile response time |
+| **P99 Latency** | 99th percentile response time | <10 seconds | 99th percentile response time |
+
+**Latency Components:**
+
+- **LLM Generation Time:** Time for model to generate response (largest component)
+- **Tool Call Time:** Time for external API calls
+- **Guardrail Processing:** Time for input/output validation
+- **Network Latency:** Time for request/response transmission
+
+---
+
+### Quality
+
+**What It Measures:** Accuracy, groundedness, and correctness of agent outputs.
+
+**Performance Indicators:**
+
+| Indicator | Definition | Target | Measurement |
+|-----------|------------|--------|-------------|
+| **Hallucination Rate** | % outputs with factual errors | <0.1% | Automated validation |
+| **Groundedness** | % outputs supported by sources | >95% | Verify outputs against source documents |
+| **Factual Accuracy** | % outputs verified as correct | >95% |Automated checks |
+
+**Unanswered questions:**
+
+- **Sampling:** Cannot evaluate 100% of outputs, due to high cost
+
+---
+
+### Safety
+
+**What It Measures:** Effectiveness of guardrails and security controls.
+
+**Performance Indicators:**
+
+| Indicator | Definition | Target | Measurement |
+|-----------|------------|--------|-------------|
+| **Guardrail Block Rate** | % malicious inputs blocked | >99.9% | Blocked attempts / Total malicious attempts |
+| **Jailbreak Success Rate** | % successful jailbreak attempts | <0.1% | Successful jailbreaks / Total attempts |
+| **PII Leakage Rate** | % outputs containing leaked PII | 0% | Detected PII leaks / Total outputs |
+| **Unauthorized Action Rate** | % actions exceeding permissions | 0% | Unauthorized actions / Total actions |
+
+**Why It Matters:** Safety failures can cause security incidents, data breaches, and regulatory violations.
+
+---
+
+### Efficiency
+
+**What It Measures:** Human-in-the-Loop (HITL) rate and cost efficiency.
+
+**Performance Indicators:**
+
+| Indicator | Definition | Target | Measurement |
+|-----------|------------|--------|-------------|
+| **HITL Rate** | % queries requiring human escalation | <10% | Human escalations / Total queries |
+| **Cost per Request** | Average cost per request | <$0.15 | Total cost / Total requests |
+| **Autonomy Level** | Current maturity level (L0-L4) | L3+ | Track progression through autonomy levels |
+
+**Why It Matters:** High HITL rates limit scalability. Cost efficiency enables business viability.
+
+**HITL Escalation Reasons:**
+
+- Low confidence (<0.7)
+- High-risk actions (write operations, high cost)
+- Guardrail violations
+- User requests human review
+
+---
+
+### System Metric Targets
+
+**Summary Table:**
+
+| Dimension | Key Indicators | Target | Frequency |
+|-----------|----------------|--------|-----------|
+| **Cognitive Accuracy** | Hallucination Rate, Groundedness, Factual Accuracy | <0.1% hallucinations, >95% grounded | Daily (sampled) |
+| **Safety Integrity** | Guardrail Block Rate, Jailbreak Rate | >99.9% blocks, <0.1% jailbreaks | Real-time |
+| **Autonomy Level** | HITL Rate, Autonomy Level | <10% HITL, L3+ | Daily |
+| **Response Performance** | Uptime, Latency (P50, P95, P99) | 99.5% uptime, P95 <5s | Real-time |
+| **Cost Efficiency** | Cost per Request, Cost per Success | <$0.15/request, <$0.10/success | Daily |
+
+**Quality Budget Consumption:**
+
+- **Green Zone (>75% budget):** Normal operations, experimentation encouraged
+- **Yellow Zone (50-75% budget):** Reduce deployment velocity, limit risky experiments
+- **Red Zone (<50% budget):** Freeze new features, emergency accuracy work
+
+---
+
+## Tier 3: Component Metrics
+
+**Audience:** Engineers debugging issues, optimizing performance
+
+**Purpose:** Granular metrics for individual components. These metrics explain why system metrics degrade and guide optimization efforts.
+
+### Circuit Breaker Metrics
+
+**What It Measures:** Circuit breaker activations and effectiveness.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Circuit Breaker Activations** | Number of times circuit opened | <10/day | Count circuit opens per day |
+| **Circuit Breaker Duration** | Average time circuit stays open | <30 seconds | Time from open to half-open |
+| **Failed Request Rate** | % requests failing before circuit opens | <5% | Failed requests / Total requests |
+
+**Why It Matters:** Frequent circuit breaker activations indicate service degradation or cascading failures.
+
+---
+
+### Fallback Usage Metrics
+
+**What It Measures:** Fallback path usage and effectiveness.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Fallback Usage Rate** | % requests using fallback paths | <15% | Fallback requests / Total requests |
+| **Fallback Success Rate** | % fallback requests that succeed | >90% | Successful fallbacks / Total fallbacks |
+| **Fallback Latency** | Average latency for fallback paths | <10 seconds | P95 latency for fallback requests |
+
+**Fallback Paths:**
+
+- **Primary → Secondary Model:** GPT-4 → GPT-3.5
+- **Model → Rules:** LLM → Rule-based system
+- **Autonomous → Human:** Agent → Human escalation
+
+---
+
+### State Management Metrics
+
+**What It Measures:** Checkpoint persistence and recovery effectiveness.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Checkpoint Persistence Latency** | Time to save checkpoint | <100ms | P95 checkpoint save time |
+| **Resumability Rate** | % workflows that resume after failure | >99% | Resumed workflows / Failed workflows |
+| **Recovery Time** | Time to recover from checkpoint | <5 seconds | Time from failure to resume |
+
+**Why It Matters:** Fast checkpoint persistence enables quick recovery. High resumability reduces user impact.
+
+---
+
+### Drift Detection Metrics
+
+**What It Measures:** Model and data drift indicators.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Input Drift Score** | Distribution shift in user queries | <0.1 | Statistical distance (KL divergence) |
+| **Output Drift Score** | Distribution shift in agent outputs | <0.1 | Statistical distance (KL divergence) |
+| **Model Drift Score** | Performance degradation on golden dataset | <5% | Accuracy drop on golden dataset |
+| **Confidence Drift** | Shift in confidence score distribution | <0.1 | Statistical distance (mean shift) |
+
+**Why It Matters:** Drift indicates model degradation or changing user behavior. Early detection enables proactive retraining.
+
+---
+
+### Tool Call Metrics
+
+**What It Measures:** External tool integration reliability.
+
+**Metrics:**
+
+| Metric | Definition | Target | Measurement |
+|--------|------------|--------|-------------|
+| **Tool Call Success Rate** | % successful tool calls | >95% | Successful calls / Total calls |
+| **Tool Call Latency** | Average latency for tool calls | <2 seconds | P95 tool call latency |
+| **Tool Call Error Rate** | % tool calls that fail | <5% | Failed calls / Total calls |
+
+**Why It Matters:** Tool failures degrade user experience and task success rates.
+
+---
+
+### Component Metric Targets
+
+**Summary Table:**
+
+| Component | Key Metrics | Target | Frequency |
+|-----------|-------------|--------|-----------|
+| **Circuit Breakers** | Activations, Duration | <10/day, <30s | Real-time |
+| **Fallbacks** | Usage Rate, Success Rate | <15%, >90% | Daily |
+| **State Management** | Persistence Latency, Resumability | <100ms, >99% | Real-time |
+| **Drift Detection** | Input/Output/Model Drift | <0.1, <5% | Daily |
+| **Tool Calls** | Success Rate, Latency | >95%, <2s | Real-time |
+
+---
+
+## Metric Relationships
+
+**How Tiers Connect:**
+
+```mermaid
+graph TB
+ A[Tier 1: Business Metrics
User Satisfaction, Task Success, Cost] --> B[Tier 2: System Metrics
Availability, Latency, Quality, Safety, Efficiency]
+ B --> C[Tier 3: Component Metrics
Circuit Breakers, Fallbacks, Drift, Tool Calls]
+
+ C -.Explains.-> B
+ B -.Drives.-> A
+
+ style A fill:#e3f2fd
+ style B fill:#fff3e0
+ style C fill:#f3e5f5
+```
+
+**Example Flow:**
+
+1. **Business Metric Degrades:** User satisfaction drops 10%
+2. **System Metric Investigation:** Task success rate drops to 85% (from 90%)
+3. **Component Metric Root Cause:** Tool call success rate drops to 90% (from 95%)
+4. **Fix:** Update external API integration, improve error handling
+5. **Verification:** Tool call success rate recovers to 96%, task success recovers to 91%, user satisfaction recovers
+
+---
+
+
+
+
+
+## Further Reading
+
+- [Operational Excellence →](../pillars/operational-excellence.md) (Performance targets, quality budgets)
+- [Resilient Architecture →](../pillars/resilient-architecture.md) (Component reliability patterns)
+
diff --git a/docs/getting-started.md b/docs/getting-started.md
new file mode 100644
index 0000000..c50dbd7
--- /dev/null
+++ b/docs/getting-started.md
@@ -0,0 +1,352 @@
+# Getting Started with AIRE
+
+## Who Should Use This Guide
+
+This guide is designed for:
+
+- **CTOs** seeking to establish AI reliability standards across engineering teams
+- **AI Architects** responsible for designing production-grade agent systems
+- **Engineering Leaders** building or scaling AI agents from prototype to production
+- **Platform Engineers** implementing infrastructure for reliable AI deployments
+
+If you're running AI agents in production (or planning to), this guide will help you adopt AIRE practices systematically.
+
+---
+
+## Understanding Your Current State
+
+Before adopting AIRE, assess your current AI reliability maturity:
+
+### Maturity Assessment
+
+| Capability | Level 0 (None) | Level 1 (Basic) | Level 2 (Intermediate) | Level 3 (Advanced) |
+|------------|----------------|-----------------|------------------------|-------------------|
+| **Testing** | Manual testing only | Some unit tests | Golden dataset exists | Offline + online evals in CI/CD |
+| **Monitoring** | Basic logs | Structured logging | CoT logging | Full observability with alerts |
+| **Failure Recovery** | Manual restart | Basic retries | State persistence | Circuit breakers + fallbacks |
+| **Security** | None | Input validation | Guardrails | JIT access + audit logs |
+| **HITL** | Ad-hoc | Queue system | Confidence-based routing | Active learning loop |
+
+
+
+---
+
+## Adoption Roadmap
+
+### Phase 1: Assess Current State (Week 1-2)
+
+**Goal:** Understand existing AI agents and reliability pain points.
+
+#### Tasks:
+1. **Inventory existing AI agents**
+ - List all production agents (chatbots, automation, data processing)
+ - Document model types (GPT-4, Claude, custom)
+ - Identify critical vs non-critical agents
+
+2. **Identify reliability pain points**
+ - What % of requests fail?
+ - How often does HITL intervene?
+ - What are top 3 user complaints?
+
+3. **Measure baseline metrics**
+ - Success rate (% of successful requests)
+ - Hallucination rate (manual sample of 50+ outputs)
+ - HITL rate (% of requests needing human review)
+ - MTTR (mean time to recover from failures)
+
+**Deliverable:** Reliability assessment report with baseline metrics
+
+---
+
+### Phase 2: Quick Wins (Month 1)
+
+**Goal:** Implement high-impact, low-effort improvements to demonstrate value.
+
+#### 2.1 Implement Golden Dataset for Critical Agent
+
+**Why:** Catches regressions before deployment.
+
+**Steps:**
+
+1. Identify your most critical agent (highest business impact)
+2. Collect 100 examples:
+ - 60 core capabilities (happy path)
+ - 30 edge cases (from production failures)
+ - 10 adversarial examples (prompt injections)
+3. Store in Git with version control
+4. Run offline evals weekly
+
+**Time:** 1 week
+**Impact:** Prevents regressions, reduces production failures by 30-50%
+
+---
+
+#### 2.2 Add Basic Guardrails
+
+**Why:** Prevents catastrophic failures from LLM misbehavior.
+
+**Steps:**
+
+1. Implement input guardrails:
+ - Prompt injection detection (keyword matching)
+ - PII redaction (email, credit card, SSN)
+ - Rate limiting (per user)
+
+2. Implement output guardrails:
+ - Sensitive data leakage prevention
+ - Length limits
+
+3. Implement action guardrails:
+ - Monetary transaction limits
+ - Email rate limits
+
+**Time:** 1 week
+
+**Impact:** Reduces security incidents by 80%+
+
+---
+
+#### 2.3 Set Up Audit Logging
+
+**Why:** Enables incident investigation and debugging.
+
+**Steps:**
+1. Log all agent requests (user query, timestamp, userId)
+2. Log agent reasoning (Chain of Thought)
+3. Log action execution (success/failure)
+4. Store logs in structured format (JSON)
+5. Define retention policy (30-90 days)
+
+**Time:** 3 days
+
+**Impact:** Reduces MTTR by 50%+
+
+---
+
+### Phase 3: Foundation (Month 2-3)
+
+**Goal:** Build core infrastructure for reliability.
+
+#### 3.1 Deploy Circuit Breakers
+
+**Why:** Prevents cascading failures.
+
+**Steps:**
+1. Identify external dependencies (LLM APIs, databases, external APIs)
+2. Implement circuit breakers for each dependency
+3. Configure failure thresholds (5 failures = open)
+4. Add fallback paths (GPT-4 → GPT-3.5 → Human)
+
+**Time:** 1 week
+
+**Impact:** Improves system uptime by 2-3 nines
+
+---
+
+#### 3.2 Implement State Persistence
+
+**Why:** Enables workflow resumption after failures.
+
+**Steps:**
+1. Choose state store (Redis, PostgreSQL, DynamoDB)
+2. Implement checkpointing (save state after each step)
+3. Add workflow resumption logic
+4. Test failure recovery
+
+**Time:** 2 weeks
+**Impact:** Eliminates expensive LLM recomputations
+
+---
+
+#### 3.3 Run Offline Evals in CI/CD
+
+**Why:** Blocks bad deployments automatically.
+
+**Steps:**
+
+1. Integrate offline evals into CI/CD pipeline
+2. Set quality gates (accuracy >95%, hallucination rate <0.1%)
+3. Block deployment if evals fail
+4. Alert team on failures
+
+**Time:** 1 week
+
+**Impact:** Prevents 90%+ of regressions from reaching production
+
+---
+
+### Phase 4: Maturity (Month 4-6)
+
+**Goal:** Achieve production-grade reliability.
+
+#### 4.1 Build Feedback Loops
+
+**Why:** System improves continuously from production failures.
+
+**Steps:**
+
+1. Collect production failures automatically
+2. Add HITL corrections to golden dataset weekly
+3. Retrain model monthly on feedback
+4. Measure improvement (HITL rate should decrease)
+
+**Time:** 3 weeks
+
+**Impact:** Reduces HITL rate by 50% over 6 months
+
+---
+
+#### 4.2 Implement Drift Detection
+
+**Why:** Catches silent performance degradation.
+
+**Steps:**
+
+1. Set up input drift monitoring (embedding divergence)
+2. Set up output drift monitoring (confidence distribution)
+3. Configure alerts (drift threshold: 0.3)
+4. Create drift response playbook
+
+**Time:** 1 week
+
+**Impact:** Detects issues before users complain
+
+---
+
+#### 4.3 Deploy JIT Privilege Access
+
+**Why:** Minimizes blast radius of security incidents.
+
+**Steps:**
+
+1. Replace master API keys with scoped tokens
+2. Implement JIT token generation (5-minute expiry)
+3. Add step-up authentication for high-risk actions
+4. Log all privilege requests
+
+**Time:** 2 weeks
+
+**Impact:** Reduces security risk by 90%+
+
+---
+
+### Phase 5: Excellence (Month 6+)
+
+**Goal:** Achieve industry-leading reliability.
+
+#### Targets:
+
+- **Hallucination Rate:** <0.1%
+- **HITL Rate:** <10%
+- **System Uptime:** 99.9%+
+- **Deployment Success Rate:** >95%
+- **MTTR:** <5 minutes
+
+#### Practices:
+
+- Quarterly golden dataset reviews
+- Monthly model retraining
+- Quarterly red team security testing
+- Continuous online evals
+- Full observability with real-time dashboards
+
+---
+
+## Implementation Priorities
+
+### Decision Tree: Where to Start?
+
+```
+Is your agent in production?
+├─ No: Start with Phase 2.1 (Golden Dataset)
+└─ Yes: Has it caused incidents?
+ ├─ No: Start with Phase 2 (Quick Wins)
+ └─ Yes: What type?
+ ├─ Security: Phase 2.2 (Guardrails) + Phase 4.3 (JIT Access)
+ ├─ Failures: Phase 3.1 (Circuit Breakers) + Phase 3.2 (State)
+ └─ Quality: Phase 2.1 (Golden Dataset) + Phase 3.3 (Offline Evals)
+```
+
+### Priority by Agent Type
+
+| Agent Type | Priority 1 | Priority 2 | Priority 3 |
+|------------|-----------|-----------|-----------|
+| **Customer Support Bot** | Guardrails | Golden Dataset | HITL Protocols |
+| **Data Processing Agent** | State Persistence | Circuit Breakers | Drift Detection |
+| **Code Generation Agent** | Golden Dataset | Self-Reflection | Offline Evals |
+| **Financial Agent** | JIT Access | Guardrails | Audit Logging |
+
+---
+
+## Success Metrics
+
+Track these metrics to measure AIRE adoption progress:
+
+### Leading Indicators (Predictive)
+
+- **Golden Dataset Coverage:** % of agents with golden datasets
+- **CI/CD Eval Integration:** % of deployments with eval gates
+- **Guardrail Coverage:** % of agents with input/output/action guardrails
+
+### Lagging Indicators (Outcomes)
+
+- **Incident Reduction:** % decrease in production incidents
+- **MTTR Improvement:** % decrease in mean time to recovery
+- **HITL Reduction:** % decrease in human escalation rate
+- **User Satisfaction:** % increase in positive feedback
+
+---
+
+## Common Challenges & Solutions
+
+### Challenge 1: "We don't have time to build golden datasets"
+
+**Solution:** Start small (20 examples), grow iteratively. Add 5-10 examples per week from production failures.
+
+---
+
+### Challenge 2: "Offline evals slow down our deployment velocity"
+
+**Solution:** Run evals in parallel (5-10 minutes). Benefits (fewer production incidents) outweigh cost.
+
+---
+
+### Challenge 3: "Our LLM outputs are too subjective to test"
+
+**Solution:** Use semantic similarity matching (80%+ similarity = pass). Not binary, but better than nothing.
+
+---
+
+### Challenge 4: "We can't afford downtime to implement these changes"
+
+**Solution:** Deploy incrementally. Start with new agents, gradually migrate legacy systems.
+
+---
+
+### Challenge 5: "Management doesn't prioritize reliability"
+
+**Solution:** Quantify cost of unreliability (incident cost × incident rate). Present ROI case.
+
+---
+
+## Next Steps
+
+1. **Assess your current state** using the maturity assessment above
+2. **Choose your starting phase** based on current maturity level
+3. **Pick one pilot agent** (critical but not too complex)
+4. **Implement Phase 2 quick wins** (golden dataset + guardrails + logging)
+5. **Measure improvement** (baseline → post-implementation metrics)
+6. **Expand to more agents** using lessons learned
+
+---
+
+## Resources
+
+- **[Pillar 1: Resilient Architecture](pillars/resilient-architecture.md)** - Start here for infrastructure patterns
+- **[Pillar 2: Cognitive Reliability](pillars/cognitive-reliability.md)** - Start here for output quality
+- **[Pillar 3: Quality & Lifecycle](pillars/quality-lifecycle.md)** - Start here for testing and deployment
+- **[Pillar 4: Security](pillars/security.md)** - Start here for adversarial robustness
+
+---
+
+*This guide is part of the [AI Reliability Engineering (AIRE) Standards](index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/docs/index.md b/docs/index.md
new file mode 100644
index 0000000..36e5d93
--- /dev/null
+++ b/docs/index.md
@@ -0,0 +1,228 @@
+# The AI Reliability Engineering (AIRE) Standards
+
+[](https://creativecommons.org/licenses/by/4.0/)
+[]()
+
+> **An open implementation guide for building reliable AI Agents at scale. Defining the practices for AI Reliability Engineering (AIRE).**
+
+---
+
+## Introduction
+
+As AI systems move from "experimental" prototypes to "mission-critical" production environments, reliability has emerged as the single biggest barrier to adoption.
+
+This repository serves as the **Open Standard for AI Reliability Engineering (AIRE)**. It documents the architectural patterns, testing frameworks, and operational practices that engineering teams use to achieve production-grade reliability in non-deterministic systems.
+
+It is not a theoretical academic paper. It is a living collection of **"Success Patterns"** gathered from practitioners running agents at scale.
+
+---
+
+## Core Pillars of AIRE
+
+We define the reliability of an Agentic System through five core pillars:
+
+### 1. Resilient Architecture
+
+*Building systems that gracefully handle failures, scale under load, and recover from errors.*
+
+Resilient architecture establishes the structural foundation for reliable AI systems. It encompasses:
+
+- **Elastic Auto-Scaling** - Horizontal and vertical scaling strategies for unpredictable AI workloads
+- **State Management** - Checkpoint-based recovery enabling workflows to resume from last checkpoint after failures (not restart from scratch)
+- **Circuit Breakers** - Fault tolerance patterns that prevent cascading failures by failing fast when services degrade
+- **Fallback Paths** - Multi-tier fallback strategies (GPT-4 → GPT-3.5 → Rules → Human)
+- **The Reliability Stack Pattern** - Separating probabilistic reasoning (LLM) from deterministic safety (guardrails)
+
+**Key Metrics:** Resumability Rate >99%, Circuit Breaker Activations <10/day, Fallback Usage Rate <15%, MTTR <5 minutes
+
+📖 **[Read the full Resilient Architecture guide →](pillars/resilient-architecture.md)**
+
+---
+
+### 2. Cognitive Reliability
+
+*Ensuring AI agents produce accurate, consistent, and trustworthy outputs.*
+
+Cognitive reliability addresses the correctness problem - ensuring outputs are grounded, validated, and trustworthy:
+
+- **Self-Reflection & Correction** - Chain-of-thought with reflection, multi-agent debate for high-stakes decisions
+- **Structured Outputs** - JSON schema validation, forced choice enums, regex-constrained generation
+- **Human-in-the-Loop (HITL) Protocols** - Confidence-based escalation with design patterns to reduce HITL over time through active learning
+- **Drift Detection** - Input drift (distribution changes), output drift (confidence shifts), model drift (version changes)
+
+**Key Metrics:** Hallucination Rate <0.1%, Groundedness >95%, HITL Rate <10%, Confidence Calibration within 10%
+
+📖 **[Read the full Cognitive Reliability guide →](pillars/cognitive-reliability.md)**
+
+---
+
+### 3. Quality & Lifecycle
+
+*Moving from "vibes-based" development to rigorous testing and continuous improvement.*
+
+Quality & Lifecycle practices define how to test, deploy, and continuously improve AI systems:
+
+- **Evals-Driven Deployments** - CI/CD gates with golden datasets, staged rollouts (canary → gradual → full), automatic rollback triggers
+- **Golden Datasets** - Curated regression suites (60% core capabilities, 30% edge cases, 10% adversarial), versioned in Git, continuously updated
+- **Unit Testing Agents** - Tool calling tests, prompt adherence tests, synthetic data tests
+- **Online vs Offline Evals** - Pre-deployment regression testing (offline) + post-deployment drift detection (online)
+- **Feedback Loops** - Production failures → HITL corrections → golden dataset updates → model retraining
+
+**Key Metrics:** Golden Dataset Accuracy >95%, Deployment Success Rate >90%, User Satisfaction >80%, Feedback Loop Latency <7 days
+
+📖 **[Read the full Quality & Lifecycle guide →](pillars/quality-lifecycle.md)**
+
+---
+
+### 4. Security
+
+*Protecting systems, data, and users from risks introduced by autonomous agents.*
+
+Security for AI agents differs from traditional software-agents are autonomous decision-makers that can be manipulated to exceed intended authority:
+
+- **Just-in-Time (JIT) Privilege Access** - Scoped tokens (action + resourceId) with automatic expiration (<5 minutes), step-up authentication for high-risk actions
+- **Audit Logs for Internal Thinking** - Logging reasoning (Chain of Thought), not just inputs/outputs; structured logs for incident investigation
+- **Guardrails** - Deterministic hard stops at three layers: input guardrails (prompt injection detection, PII redaction), output guardrails (sensitive data leakage prevention), action guardrails (rate limits, monetary limits)
+- **Prompt Injection Defenses** - Instruction hierarchy, input sanitization, multi-model validation, sandboxing
+- **Data Privacy in Context Windows** - Context isolation per session, PII redaction, ephemeral context for sensitive data, encryption at rest, GDPR compliance
+
+**Key Metrics:** Prompt Injection Attempts <10/day, Jailbreak Success Rate <0.1%, PII Leakage Incidents 0, MTTD <5 minutes
+
+📖 **[Read the full Security guide →](pillars/security.md)**
+
+---
+
+### 5. Operational Excellence & Team Culture
+
+*Establishing performance targets, quality budgets, team structures, and operational practices that enable reliable AI systems to scale.*
+
+Operational Excellence bridges the gap between technical architecture and organizational culture. While the first four pillars define *what* to build, this pillar defines *how* teams operate, measure, and continuously improve AI systems at scale:
+
+- **AI-Specific Performance Targets & Quality Budgets** - Performance targets for cognitive accuracy, safety integrity, autonomy level, response performance, and cost efficiency; quality budget policies for balancing reliability with innovation velocity
+- **Team Structure & Shared Responsibility** - Product teams own agents end-to-end; embedded AI Reliability Engineers (AIREs) with 20% time allocation; central platform team provides infrastructure
+- **Progressive Autonomy Maturity Model** - Five levels of agent autonomy (L0: Human-Driven → L4: Autonomous), reducing HITL rate from 100% to <5% over time
+- **Reliability Reviews** - Weekly metric reviews, monthly postmortems, quality budget tracking, performance target compliance monitoring
+
+**Key Metrics:** Performance Target Compliance >95%, Quality Budget Remaining >50%, HITL Rate <10%, Autonomy Level L3+, Time to Autonomy <6 months
+
+📖 **[Read the full Operational Excellence & Team Culture guide →](pillars/operational-excellence.md)**
+
+---
+
+
+## AIRE Principles
+
+*Guiding tenets inspired by SRE:*
+
+These five principles define the philosophical foundation of AIRE. They inform the practices detailed in the five pillars and help teams make trade-off decisions when designing reliable AI systems.
+
+### 1. Embrace Non-Determinism
+
+Accept that identical inputs will produce variable outputs. Design systems that succeed despite variance, not systems that assume consistency.
+
+**Key Insight:** AI systems are probabilistic reasoners. Don't try to make them deterministic-build resilience around their non-determinism through structured outputs, guardrails, and fallback paths.
+
+### 2. Reliability is a Feature
+
+Reliability competes with velocity for engineering resources. Treat it as a first-class product requirement with explicit budgets, not an afterthought.
+
+**Key Insight:** Allocate dedicated engineering time (e.g., 20% of sprints) to reliability work: golden dataset updates, eval pipeline maintenance, incident reviews.
+
+### 3. Measure, Don't Assume
+
+If you cannot quantify the reliability of your AI system, you do not have a reliable AI system. Intuition is not evidence.
+
+**Key Insight:** Track concrete metrics (hallucination rate <0.1%, HITL rate <10%, uptime >99.9%). Block deployments if metrics degrade.
+
+### 4. Fail Gracefully, Fail Informatively
+
+Every failure should preserve context, enable recovery, and generate learnings. Silent failures are unacceptable.
+
+**Key Insight:** Save checkpoints, log Chain of Thought reasoning, return user-friendly errors, and ensure workflows can resume after crashes.
+
+### 5. Humans as Fallback, Not Crutch
+
+Design for autonomous operation. Human escalation is a safety net for edge cases, not a substitute for robust engineering.
+
+**Key Insight:** Reduce HITL rate over time through active learning. Start at 100% human review, target <10% through continuous improvement.
+
+📖 **[Read the detailed AIRE Principles guide →](principles.md)**
+
+---
+
+
+## Getting Started
+
+**New to AIRE?** Start with the **[Getting Started Guide →](getting-started.md)** for a step-by-step adoption roadmap:
+
+- **Phase 1 (Week 1-2):** Assess current state, measure baseline metrics
+- **Phase 2 (Month 1):** Quick wins - golden dataset, guardrails, audit logging
+- **Phase 3 (Month 2-3):** Foundation - circuit breakers, state persistence, CI/CD evals
+- **Phase 4 (Month 4-6):** Maturity - feedback loops, drift detection, JIT access
+- **Phase 5 (Month 6+):** Excellence - hallucination rate <0.1%, HITL rate <10%, uptime 99.9%+
+
+**Want to dive deep?** Explore the [complete documentation →](https://aire.exosphere.host)
+
+---
+
+## Ongoing Research
+
+This standard evolves through continuous dialogue with engineering teams operating AI systems in production. We conduct ongoing interviews with practitioners to surface new failure modes, validate emerging patterns, and refine existing guidance.
+
+**Are you running Agents in production?**
+We are actively seeking contributors to share their architectural decisions, operational challenges, and reliability wins.
+
+* **[Share Your Experience](https://cal.com/outer-space/interview-ai-reliability-standards)**
+
+**Why Contribute?**
+You get to shape the future of AI reliability engineering and get recognized for your contributions.
+
+| Benefit | Details |
+|---------|---------|
+| **Shape the Standard** | Your operational insights become codified best practices. Influence how the industry approaches AI reliability for years to come |
+| **Industry Recognition** | Listed in the Contributors Registry as a contributor to the standards of AI reliability |
+| **Peer Network** | Join a private forum of engineering leaders exchanging reliability patterns across enterprises |
+| **Early Access** | Preview new sections and reference architectures before public release |
+| **Thank you gift** | We will send you a gift hamper courtesy to our sponsors |
+
+---
+
+## Repository Structure
+
+```
+docs/
+├── getting-started.md # Adoption roadmap for organizations
+├── pillars/
+│ ├── resilient-architecture.md # Pillar 1: Fault tolerance, scaling, recovery
+│ ├── cognitive-reliability.md # Pillar 2: Accuracy, consistency, drift detection
+│ ├── quality-lifecycle.md # Pillar 3: Testing, deployment, feedback loops
+│ ├── security.md # Pillar 4: JIT access, guardrails, audit logs
+│ └── operational-excellence.md # Pillar 5: SLAs, team structure, progressive autonomy
+└── appendix/
+ ├── principles.md # AIRE Principles (5 guiding tenets)
+ ├── metrics-framework.md # Three-tier metrics framework
+ └── glossary.md # Key terms and definitions
+```
+
+---
+
+## Contribution & Governance
+
+This standard belongs to the community.
+
+We welcome Pull Requests (PRs) from engineers who have solved specific reliability challenges.
+
+* See a missing pattern? Open a PR.
+* Want to debate a standard? Open an Issue.
+
+## Sponsors
+
+
+
+Contact nivedit@exosphere.host to sponsor this work.
+
+## License
+
+This work is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).
+
+You are free to share and adapt this material for any purpose, even commercially, as long as you give appropriate credit.
diff --git a/docs/pillars/cognitive-reliability.md b/docs/pillars/cognitive-reliability.md
new file mode 100644
index 0000000..2bb9887
--- /dev/null
+++ b/docs/pillars/cognitive-reliability.md
@@ -0,0 +1,249 @@
+# Pillar 2: Cognitive Reliability
+
+## Philosophy
+
+> *"Measure, Don't Assume"* - If you cannot quantify reliability, you do not have a reliable system. Intuition is not evidence.
+
+Cognitive reliability addresses the correctness problem: ensuring outputs are accurate, grounded, and trustworthy. Unlike traditional software bugs (deterministic and reproducible), AI failures are probabilistic-hallucinations, drift, and inconsistency emerge unpredictably.
+
+**The goal:** Validate outputs, detect drift, and continuously improve through measurement.
+
+---
+
+## Core Concepts
+
+### 1. Self-Reflection & Correction
+
+**Principle:** Make agents critique their own outputs before finalizing decisions.
+
+For high-stakes decisions, single-pass reasoning is insufficient. Self-reflection adds a validation layer where the agent reviews its own work.
+
+**Two Approaches:**
+
+**Chain-of-Thought with Reflection:**
+1. Agent generates initial answer with reasoning
+2. Agent critiques its own reasoning (identify flaws, biases, missing context)
+3. Agent revises answer based on critique
+4. Return final answer
+
+**Multi-Agent Debate:**
+1. Multiple agents independently generate answers
+2. Agents debate their solutions (argue for/against each approach)
+3. Consensus mechanism selects final answer (majority vote, confidence-weighted, or meta-agent arbitration)
+
+**When to Use:**
+- High-stakes decisions (medical diagnosis, legal advice, financial transactions)
+- Complex reasoning tasks (multi-step math, code generation, strategic planning)
+- Low-confidence outputs (agent uncertainty score <0.7)
+
+**Trade-offs:**
+- **Cost:** 2-5x more LLM calls
+- **Latency:** 2-3x slower response time
+- **Accuracy:** 15-40% reduction in error rate (domain-dependent)
+
+**Implementation Pattern:**
+
+```pseudocode
+function selfReflect(userQuery):
+ # Step 1: Generate initial answer
+ initialAnswer = llm.generate(userQuery)
+
+ # Step 2: Self-critique
+ critique = llm.generate(
+ "Review this answer for errors, biases, and gaps: " + initialAnswer
+ )
+
+ # Step 3: Revise based on critique
+ finalAnswer = llm.generate(
+ "Original: " + initialAnswer +
+ "\nCritique: " + critique +
+ "\nProvide revised answer:"
+ )
+
+ return finalAnswer
+```
+
+---
+
+### 2. Structured Outputs
+
+**Principle:** Force outputs into predictable formats for deterministic validation.
+
+LLMs produce unstructured text. Structured outputs (JSON, enums, regex-constrained) enable programmatic validation and downstream integration.
+
+**Three Techniques:**
+
+| Technique | Use Case | Example |
+|-----------|----------|---------|
+| **JSON Schema** | Complex nested data | `{"sentiment": "positive", "confidence": 0.92, "entities": [...]}` |
+| **Forced Choice (Enums)** | Classification tasks | `status: ["approved", "rejected", "needs_review"]` |
+| **Regex Constraints** | Formatted strings | Email, phone numbers, dates |
+
+**Benefits:**
+
+- **Validation:** Reject malformed outputs before they reach production
+- **Type Safety:** Integrate with strongly-typed codebases
+- **Consistency:** Eliminate format variations ("yes" vs "Yes" vs "true")
+
+**Implementation Pattern:**
+
+```pseudocode
+schema = {
+ "type": "object",
+ "properties": {
+ "action": {"enum": ["approve", "reject", "escalate"]},
+ "confidence": {"type": "number", "minimum": 0, "maximum": 1},
+ "reasoning": {"type": "string"}
+ },
+ "required": ["action", "confidence"]
+}
+
+function processWithSchema(userQuery):
+ rawOutput = llm.generate(userQuery)
+
+ try:
+ structuredOutput = validateSchema(rawOutput, schema)
+ return structuredOutput
+ catch ValidationError:
+ # Retry with schema in prompt
+ retryOutput = llm.generate(
+ userQuery + "\nRespond in JSON format: " + schema
+ )
+ return validateSchema(retryOutput, schema)
+```
+
+---
+
+### 3. Human-in-the-Loop (HITL) Protocols
+
+**Principle:** Use humans as a safety net for edge cases-not a crutch for poor engineering.
+
+HITL adds human review for high-stakes or low-confidence decisions. The goal is to **reduce HITL over time** through active learning.
+
+**Confidence-Based Escalation:**
+
+```mermaid
+graph TD
+ A[Agent Output] --> B{Confidence > 0.9?}
+ B -->|Yes| C[Auto-Execute]
+ B -->|No| D{Confidence > 0.7?}
+ D -->|Yes| E[Execute with Warning]
+ D -->|No| F[Escalate to Human]
+ F --> G[Human Decision]
+ G --> H[Add to Golden Dataset]
+ H --> I[Retrain Model]
+```
+
+**Design Patterns to Reduce HITL:**
+
+| Pattern | Description | Example |
+|---------|-------------|---------|
+| **Active Learning** | Add human corrections to training data | HITL corrections → golden dataset → model retraining |
+| **Staged Rollout** | Start with 100% HITL, reduce over time | Month 1: 100% review → Month 3: 10% review |
+| **Confidence Calibration** | Improve agent's self-awareness | Train model to predict its own accuracy |
+| **Batch Review** | Group similar low-confidence cases | Human reviews 50 refund requests at once |
+
+**Metrics:**
+
+- **HITL Rate:** % requests requiring human review (target: <10%)
+- **HITL Response Time:** Median time from escalation to human decision (target: <5 minutes)
+- **Override Rate:** % times humans overrule agent (target: <20%)
+
+**Anti-Pattern:** Using HITL for all decisions because "we don't trust the AI." This defeats the purpose of automation.
+
+---
+
+### 4. Drift Detection
+
+**Principle:** Monitor for distribution shifts in inputs, outputs, and model behavior.
+
+AI systems degrade over time as real-world data drifts from training data. Proactive drift detection prevents silent failures.
+
+**Three Types of Drift:**
+
+| Drift Type | What Changes | Example | Detection Method |
+|------------|--------------|---------|------------------|
+| **Input Drift** | User query distribution | COVID pandemic shifts customer support queries | Embedding divergence, statistical tests |
+| **Output Drift** | Agent response patterns | Model starts refusing more queries | Sentiment shift, keyword frequency |
+| **Model Drift** | Underlying model behavior | GPT-4 version update changes reasoning style | A/B test old vs new model |
+
+**Embedding Divergence Tracking:**
+
+Compare embedding distributions between baseline (training data) and production (live queries):
+
+```pseudocode
+function detectInputDrift():
+ # Baseline embeddings from golden dataset
+ baselineEmbeddings = embed(goldenDataset.inputs)
+
+ # Production embeddings from last 24 hours
+ productionEmbeddings = embed(recentQueries)
+
+ # Calculate divergence (KL divergence, cosine distance)
+ divergence = calculateDivergence(baselineEmbeddings, productionEmbeddings)
+
+ if divergence > DRIFT_THRESHOLD:
+ alert("Input drift detected: " + divergence)
+ triggerDatasetRefresh()
+```
+
+**Mitigation Strategies:**
+
+- **Dataset refresh:** Add recent production examples to golden dataset
+- **Model retraining:** Fine-tune on recent data
+- **Prompt updates:** Adjust prompts for new query patterns
+- **Fallback triggers:** Route drifted queries to more powerful models
+
+---
+
+## Metrics & Observability
+
+Track these metrics to measure cognitive reliability:
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| **Hallucination Rate** | <0.1% | % outputs containing factually incorrect claims |
+| **Groundedness** | >95% | % claims supported by retrieved context or known facts |
+| **Consistency Rate** | >90% | % identical inputs producing semantically equivalent outputs |
+| **HITL Rate** | <10% | % requests requiring human review |
+| **Confidence Calibration** | Within 10% | Difference between predicted confidence and actual accuracy |
+| **Drift Alert Frequency** | <1/week | Count of drift alerts triggering dataset refresh |
+
+**Measurement Techniques:**
+
+- **Hallucination Rate:** Use fact-checking models (e.g., retrieval-augmented verification)
+- **Groundedness:** Compare output claims against source documents (citation matching)
+- **Consistency:** Generate embeddings for outputs to same input; measure cosine similarity
+- **Confidence Calibration:** Plot predicted confidence vs. actual accuracy; measure calibration error
+
+---
+
+## Common Pitfalls
+
+1. **No Structured Outputs**
+ - *Problem:* LLM returns freeform text that breaks downstream systems
+ - *Fix:* Enforce JSON schemas or enums for all production outputs
+
+2. **Over-Reliance on Self-Reflection**
+ - *Problem:* Using reflection for all queries wastes cost/latency
+ - *Fix:* Reserve reflection for high-stakes decisions only
+
+3. **Static Golden Datasets**
+ - *Problem:* Dataset becomes stale as real-world queries drift
+ - *Fix:* Continuously update golden dataset from production failures
+
+4. **HITL as Crutch**
+ - *Problem:* 50%+ of queries need human review indefinitely
+ - *Fix:* Implement active learning to reduce HITL over time
+
+5. **No Confidence Calibration**
+ - *Problem:* Agent claims 90% confidence but is only 50% accurate
+ - *Fix:* Train model on confidence prediction; validate against ground truth
+
+6. **Ignoring Drift**
+ - *Problem:* Performance silently degrades as data distribution shifts
+ - *Fix:* Set up automated drift monitoring with alerts
+
+---
+
+*This pillar is part of the [AI Reliability Engineering (AIRE) Standards](../index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/docs/pillars/operational-excellence.md b/docs/pillars/operational-excellence.md
new file mode 100644
index 0000000..23a1cf9
--- /dev/null
+++ b/docs/pillars/operational-excellence.md
@@ -0,0 +1,456 @@
+# Pillar 5: Operational Excellence & Team Culture
+
+## Philosophy
+
+> *"Reliability is a Feature"* - Reliability competes with velocity for engineering resources. Treat it as a first-class product requirement with explicit budgets, not an afterthought.
+
+Operational Excellence bridges the gap between technical architecture and organizational culture. While the first four pillars define *what* to build, this pillar defines *how* teams operate, measure, and continuously improve AI systems at scale.
+
+**The goal:** Establish performance targets, quality budgets, team structures, and operational practices that enable reliable AI systems to scale across organizations.
+
+---
+
+## Core Concepts
+
+### 1. AI-Specific Performance Targets and Quality Budgets
+
+**Principle:** Define performance targets that matter for AI systems-not just uptime, but cognitive accuracy, safety, and autonomy.
+
+Traditional SRE focuses on binary success/failure (uptime, error rate). AI systems operate in a probabilistic space where "success" is nuanced: outputs can be partially correct, hallucinations can be subtle, and quality degrades gradually. **Quality Budgets** (not error budgets) track acceptable degradation in accuracy, groundedness, and safety-enabling teams to balance innovation velocity with reliability.
+
+**Key Difference:** Traditional error budgets track "how many failures can we tolerate?" AI quality budgets track "how much accuracy degradation can we accept while shipping improvements?"
+
+**AI Performance Dimensions:**
+
+| Dimension | Performance Indicators | Example Measurement |
+|-----------|----------------------|---------------------|
+| **Cognitive Accuracy** | Hallucination rate, factual correctness, groundedness | % outputs verified as factually correct (sampled) |
+| **Safety Integrity** | Guardrail effectiveness, jailbreak resistance | % malicious inputs successfully blocked |
+| **Autonomy Level** | HITL rate, confidence calibration | % queries resolved without human escalation |
+| **Response Performance** | Latency (P50, P95, P99), availability | Time from user query to agent response |
+| **Cost Efficiency** | Cost per successful interaction | Total cost / Successful interactions |
+
+**Example Performance Target Definitions:**
+
+```yaml
+Service: Customer Support Agent
+Performance Targets:
+ Cognitive Accuracy:
+ Indicator: Factually correct outputs / Total outputs (sampled)
+ Target: 95% accuracy (monthly)
+ Quality Budget: 5% degradation acceptable (allows experimentation)
+
+ Safety Integrity:
+ Indicator: Successful guardrail blocks / Total malicious attempts
+ Target: 99.9% block rate
+ Quality Budget: 0.1% jailbreak success rate (zero tolerance for safety)
+
+ Autonomy Level:
+ Indicator: Queries resolved autonomously / Total queries
+ Target: 90% autonomous (HITL rate <10%)
+ Quality Budget: 10% can require human escalation (progressive improvement)
+
+ Response Performance:
+ Indicator: P95 response time
+ Target: <5 seconds
+ Quality Budget: 5% of requests can exceed 5 seconds
+
+ Cost Efficiency:
+ Indicator: Cost per successful interaction
+ Target: <$0.10 per success
+ Quality Budget: 10% cost variance acceptable
+```
+
+**Quality Budget Policy:**
+
+Quality budgets track acceptable degradation in performance dimensions, enabling teams to experiment while maintaining reliability:
+
+- **Green Zone (>75% budget remaining):** Normal operations, feature development continues, experimentation encouraged
+- **Yellow Zone (50-75% budget remaining):** Reduce deployment velocity, focus on accuracy improvements, limit risky experiments
+- **Red Zone (<50% budget remaining):** Freeze new features, emergency accuracy work only, rollback if necessary
+
+**Quality Budget Consumption:**
+
+AI systems degrade gradually, not in binary failures. Quality budgets track acceptable accuracy degradation:
+
+- **Cognitive Accuracy Degradation:** Hallucination rate increases, factual correctness drops, groundedness decreases
+- **Safety Degradation:** Guardrail effectiveness drops, jailbreak success rate increases
+- **Autonomy Regression:** HITL rate increases, confidence calibration worsens
+- **Performance Degradation:** Latency increases, availability drops, cost per request increases
+
+**Tracking Quality Budget:**
+
+```pseudocode
+function consumeQualityBudget(dimension, degradation):
+ budget = getCurrentQualityBudget(dimension)
+
+ # Different dimensions have different weights
+ if dimension == "cognitive_accuracy":
+ consumed = degradation * 1.0 # Full weight (core reliability)
+ elif dimension == "safety":
+ consumed = degradation * 20.0 # 20x weight (zero tolerance)
+ elif dimension == "autonomy":
+ consumed = degradation * 0.5 # Half weight (progressive improvement)
+ elif dimension == "performance":
+ consumed = degradation * 0.3 # Low weight (operational concern)
+
+ budget.remaining -= consumed
+
+ if budget.remaining < 0.5:
+ triggerRedZoneProtocol(dimension)
+```
+
+---
+
+### 2. Team Structure and Shared Responsibility
+
+**Principle:** Product teams own their AI agents end-to-end (dev, deploy, operate). Central platform teams provide infrastructure and tooling.
+
+Traditional DevOps separates development from operations. AI Reliability Engineering requires **embedded ownership**-teams that build agents must also operate them. This creates accountability and faster feedback loops.
+
+**Shared Responsibility Model:**
+
+```mermaid
+graph TB
+ subgraph Product["Product Team (Owns End-to-End)"]
+ A[Product Manager] --> B[AI Engineers]
+ B --> C[AI Reliability Engineer]
+ C --> D[Deployment & Operations]
+ end
+
+ subgraph Platform["Central AI Platform Team"]
+ E[Evals Platform] --> B
+ F[Guardrails SDK] --> B
+ G[Monitoring & Observability] --> D
+ H[Cost Management] --> D
+ end
+
+ B --> E
+ B --> F
+ D --> G
+ D --> H
+
+ style Product fill:#e3f2fd
+ style Platform fill:#fff3e0
+```
+
+**Team Structure:**
+
+**1. Product Teams (Owners):**
+- **AI Engineers:** Build and maintain agents, prompts, tool integrations
+- **AI Reliability Engineers (AIREs):** Embedded reliability specialists (20% time allocation)
+- **Product Managers:** Define performance targets, prioritize reliability work
+- **Responsibilities:**
+ - End-to-end ownership of agent reliability
+ - Golden dataset curation and updates
+ - Production incident response
+ - Performance target compliance and quality budget management
+
+**2. Central AI Platform Team (Infrastructure):**
+- **Platform Engineers:** Build and maintain shared infrastructure
+- **Responsibilities:**
+ - Evals platform (CI/CD integration, golden dataset execution)
+ - Guardrails SDK (standardized security controls)
+ - Monitoring and observability (dashboards, alerts, performance indicator tracking)
+ - Cost optimization tooling (model routing, caching, rate limiting)
+
+**Embedded Reliability Engineers:**
+
+AI Reliability Engineers (AIREs) are embedded in product teams, not centralized. This ensures reliability work is prioritized alongside feature development.
+
+**20% Time Allocation Model:**
+
+- **10% Golden Dataset Maintenance:** Weekly updates from production failures, HITL escalations
+- **5% Eval Pipeline Improvements:** Reduce eval runtime, improve coverage, add new test cases
+- **5% Incident Response:** Postmortems, root cause analysis, reliability improvements
+
+**Reliability Review Meetings:**
+
+**Weekly Metric Reviews:**
+- Review performance indicator trends (cognitive accuracy, safety integrity, autonomy level, response performance, cost efficiency)
+- Quality budget consumption status
+- Identify degradation trends before performance target violations
+- Action items for reliability improvements
+
+**Monthly Postmortem Reviews:**
+- Deep dive into production incidents
+- Update golden datasets with failure cases
+- Refine performance targets based on learnings
+- Share patterns across teams
+
+**Example Meeting Structure:**
+
+```markdown
+Weekly Reliability Review (30 minutes):
+1. Performance Indicator Review (5 min)
+ - Cognitive Accuracy: 94.2% (target: 95%) ⚠️
+ - Safety Integrity: 99.95% (target: 99.9%) ✓
+ - Autonomy Level: 88% autonomous (target: 90%) ⚠️
+ - Response Performance: P95 4.2s (target: <5s) ✓
+ - Cost Efficiency: $0.11/success (target: <$0.10) ⚠️
+
+2. Quality Budget Status (5 min)
+ - Remaining: 65% (Yellow Zone)
+ - Cognitive accuracy degradation consuming budget faster than expected
+
+3. Action Items (20 min)
+ - [AIRE] Add 20 new quality test cases to golden dataset
+ - [AI Engineer] Investigate accuracy drop in recent deployment
+ - [PM] Review HITL escalation patterns for autonomy improvements
+```
+
+---
+
+### 3. AI Ops Mindset & Progressive Autonomy
+
+**Vision:** AI systems should progressively become more autonomous, requiring less human intervention over time.
+
+Human-in-the-Loop (HITL) is a safety net, not a permanent crutch. The goal is to reduce HITL rate over time through active learning, improved guardrails, and better confidence calibration.
+
+**Progressive Autonomy Maturity Model:**
+
+Five levels of agent autonomy, from fully human-driven to fully autonomous:
+
+| Level | Name | Human Role | Example | HITL Rate |
+|-------|------|------------|---------|-----------|
+| **L0** | Human-Driven | Human makes all decisions | Agent suggests actions, human approves each | 100% |
+| **L1** | Assisted | Human approves high-risk actions | Agent executes low-risk, escalates high-risk | 30-50% |
+| **L2** | Monitored | Human reviews periodically | Agent executes, human audits samples | 10-20% |
+| **L3** | Supervised | Human intervenes on anomalies | Agent executes, human alerted on drift/anomalies | 5-10% |
+| **L4** | Autonomous | Human defines policies only | Agent executes fully autonomously within guardrails | <5% |
+
+**Maturity Progression:**
+
+```mermaid
+graph LR
+ A[L0: Human-Driven
100% HITL] --> B[L1: Assisted
30-50% HITL]
+ B --> C[L2: Monitored
10-20% HITL]
+ C --> D[L3: Supervised
5-10% HITL]
+ D --> E[L4: Autonomous
<5% HITL]
+
+ style A fill:#ffebee
+ style B fill:#fff3e0
+ style C fill:#fff9c4
+ style D fill:#e8f5e9
+ style E fill:#e3f2fd
+```
+
+**Level 0: Human-Driven (100% HITL)**
+
+**Characteristics:**
+- Agent generates suggestions, human approves every action
+- No autonomous execution
+- High safety, low efficiency
+
+**Use Cases:**
+- High-stakes domains (medical diagnosis, legal advice)
+- Early-stage agents (first 30 days in production)
+- Regulatory compliance requirements
+
+**Example:**
+```pseudocode
+function processRequest(userRequest):
+ suggestion = agent.generateAction(userRequest)
+ humanApproval = await humanReview(suggestion)
+
+ if humanApproval.approved:
+ return executeAction(suggestion)
+ else:
+ return humanApproval.feedback
+```
+
+**Level 1: Assisted (30-50% HITL)**
+
+**Characteristics:**
+- Agent executes low-risk actions autonomously
+- Human approval required for high-risk actions
+- Risk classification based on action type, confidence score, resource impact
+
+**Risk Classification:**
+- **Low-Risk:** Read-only operations, low-cost actions, high-confidence outputs
+- **High-Risk:** Write operations, high-cost actions, low-confidence outputs, sensitive data access
+
+**Example:**
+```pseudocode
+function processRequest(userRequest):
+ action = agent.generateAction(userRequest)
+ riskLevel = classifyRisk(action, agent.confidence)
+
+ if riskLevel == "low":
+ return executeAction(action) # Autonomous
+ else:
+ humanApproval = await humanReview(action)
+ return executeAction(action) if humanApproval.approved else reject()
+```
+
+**Level 2: Monitored (10-20% HITL)**
+
+**Characteristics:**
+- Agent executes autonomously
+- Human reviews random samples (10-20% of requests)
+- Post-execution audit, not pre-execution approval
+
+**Sampling Strategy:**
+- Random sampling: 10% of all requests
+- Stratified sampling: Higher rate for high-risk actions
+- Anomaly sampling: 100% review for drift alerts, low confidence
+
+**Example:**
+```pseudocode
+function processRequest(userRequest):
+ action = agent.generateAction(userRequest)
+ result = executeAction(action)
+
+ # Post-execution sampling
+ if shouldSample(userRequest, result):
+ humanReview = await humanAudit(userRequest, action, result)
+ if humanReview.flagged:
+ triggerCorrection(result, humanReview.feedback)
+
+ return result
+```
+
+**Level 3: Supervised (5-10% HITL)**
+
+**Characteristics:**
+- Agent executes fully autonomously
+- Human intervention only on anomalies (drift, low confidence, guardrail triggers)
+- Proactive monitoring, reactive human involvement
+
+**Anomaly Detection:**
+- Input drift: Distribution shift in user queries
+- Output drift: Confidence score degradation
+- Model drift: Performance degradation on golden dataset
+- Guardrail triggers: Safety violations, rate limit breaches
+
+**Example:**
+```pseudocode
+function processRequest(userRequest):
+ # Anomaly detection
+ if detectDrift(userRequest) or detectLowConfidence() or guardrailTriggered():
+ humanIntervention = await humanReview(userRequest)
+ return processWithHumanGuidance(userRequest, humanIntervention)
+
+ # Normal autonomous execution
+ action = agent.generateAction(userRequest)
+ return executeAction(action)
+```
+
+**Level 4: Autonomous (<5% HITL)**
+
+**Characteristics:**
+- Agent executes fully autonomously within guardrails
+- Human defines policies, not individual decisions
+- HITL only for policy exceptions and edge cases
+
+**Policy-Based Control:**
+- Guardrails enforce deterministic constraints
+- Confidence thresholds define autonomous boundaries
+- Cost limits prevent runaway spending
+- Audit logs enable retrospective review
+
+**Example:**
+```pseudocode
+function processRequest(userRequest):
+ # Policy checks (deterministic)
+ if violatesGuardrails(userRequest):
+ return rejectWithReason("Guardrail violation")
+
+ if exceedsCostLimit(userRequest):
+ return escalateToHuman("Cost limit exceeded")
+
+ # Autonomous execution
+ action = agent.generateAction(userRequest)
+ return executeAction(action)
+```
+
+**Progression Strategy:**
+
+**Phase 1: Start at L0 (Human-Driven)**
+- Build trust through human oversight
+- Collect failure patterns for golden dataset
+- Establish baseline metrics
+
+**Phase 2: Move to L1 (Assisted)**
+- Classify actions by risk level
+- Enable autonomous execution for low-risk actions
+- Monitor HITL rate and error rates
+
+**Phase 3: Advance to L2 (Monitored)**
+- Implement sampling-based review
+- Reduce HITL rate to 10-20%
+- Focus on high-risk action classification
+
+**Phase 4: Reach L3 (Supervised)**
+- Deploy anomaly detection
+- Reduce HITL to 5-10%
+- Improve confidence calibration
+
+**Phase 5: Achieve L4 (Autonomous)**
+- Policy-based control replaces case-by-case review
+- HITL rate <5%
+- Continuous improvement through feedback loops
+
+**Key Metrics for Progression:**
+
+| Metric | L0→L1 | L1→L2 | L2→L3 | L3→L4 |
+|--------|-------|-------|-------|-------|
+| **HITL Rate** | 100% → 40% | 40% → 15% | 15% → 7% | 7% → 3% |
+| **Error Rate** | Baseline | <2% increase | <1% increase | <0.5% increase |
+| **Confidence Calibration** | N/A | ±15% | ±10% | ±5% |
+| **Time in Level** | 30 days | 60 days | 90 days | Continuous |
+
+---
+
+## Metrics & Observability
+
+Track these metrics to measure operational excellence:
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| **Performance Target Compliance** | >95% | % of performance targets met per month |
+| **Quality Budget Remaining** | >50% | % of quality budget remaining at month end |
+| **HITL Rate** | <10% | % queries requiring human escalation |
+| **Autonomy Level** | L3+ | Current maturity level (L0-L4) |
+| **Reliability Review Attendance** | >90% | % team members attending weekly reviews |
+| **Golden Dataset Update Frequency** | Weekly | Days between dataset updates |
+| **Postmortem Completion Rate** | 100% | % incidents with completed postmortems |
+| **Time to Autonomy** | <6 months | Time from L0 to L3 |
+
+**Observability Requirements:**
+
+- **Performance Dashboards:** Real-time performance indicator tracking, quality budget consumption
+- **HITL Analytics:** Escalation patterns, root causes, reduction trends
+- **Autonomy Tracking:** Current level, progression velocity, regression alerts
+- **Team Metrics:** Reliability review attendance, postmortem completion, golden dataset health
+
+---
+
+## Common Pitfalls
+
+1. **No Quality Budgets**
+ - *Problem:* Treating all reliability work as equally urgent, unable to balance innovation with accuracy
+ - *Fix:* Define performance targets and quality budgets for each dimension. Use budgets to prioritize work and enable experimentation.
+
+2. **Centralized Reliability Team**
+ - *Problem:* Reliability becomes "someone else's problem," product teams don't own outcomes
+ - *Fix:* Embed AIREs in product teams. Central platform team provides infrastructure only.
+
+3. **Static HITL Rate**
+ - *Problem:* HITL rate stays at 100% indefinitely, no progression toward autonomy
+ - *Fix:* Implement Progressive Autonomy Maturity Model. Set targets for HITL reduction.
+
+4. **Missing Reliability Reviews**
+ - *Problem:* No regular cadence for reviewing metrics, incidents go unaddressed
+ - *Fix:* Weekly metric reviews, monthly postmortems. Make attendance mandatory.
+
+5. **Performance Targets Without Action**
+ - *Problem:* Tracking metrics but not using them to drive decisions
+ - *Fix:* Link performance target violations to deployment freezes. Use quality budgets to gate feature velocity and enable controlled experimentation.
+
+---
+
+## Further Reading
+
+- [AIRE Principles →](../principles.md)
+
diff --git a/docs/pillars/quality-lifecycle.md b/docs/pillars/quality-lifecycle.md
new file mode 100644
index 0000000..f7a323c
--- /dev/null
+++ b/docs/pillars/quality-lifecycle.md
@@ -0,0 +1,240 @@
+# Pillar 3: Quality & Lifecycle
+
+## Philosophy
+
+> *"Reliability is a Feature"* - Reliability competes with velocity for engineering resources. Treat it as a first-class requirement, not an afterthought.
+
+Quality in AI systems cannot rely on "vibes" or spot checks. Unlike traditional software where correctness is binary, AI correctness is subjective and probabilistic. Quality & Lifecycle practices move development from intuition to rigorous, measurable engineering.
+
+**The goal:** Measurable, reproducible, and improvable systems through automated testing and feedback loops.
+
+---
+
+## Core Concepts
+
+### 1. Evals-Driven Deployments
+
+**Principle:** Never deploy without passing a regression suite. Vibes are not a deployment strategy.
+
+CI/CD gates for AI systems must measure output correctness, not just code correctness.
+
+**Deployment Pipeline:**
+
+```mermaid
+graph TD
+ A[Code Changes] --> B[Unit Tests]
+ B -->|Pass| C[Offline Evals]
+ C -->|Pass| D[Staging]
+ D --> E[Online Evals]
+ E -->|Pass| F[Canary 5%]
+ F --> G[Monitor 24h]
+ G -->|OK| H[Gradual 50%]
+ H --> I[Monitor 48h]
+ I -->|OK| J[Full 100%]
+
+ C -->|Fail| K[Block]
+ E -->|Fail| K
+ G -->|Degrade| L[Rollback]
+ I -->|Degrade| L
+
+ style K fill:#f44336
+ style L fill:#ff9800
+```
+
+**Eval-Driven Deployment Checklist:**
+
+| Stage | Action | Pass Criteria | Rollback Trigger |
+|-------|--------|---------------|------------------|
+| **Offline Evals** | Run on golden dataset | Accuracy >95% | <95% accuracy |
+| **Staging** | Deploy to internal users | No crashes | Critical errors |
+| **Canary (5%)** | Monitor P95 latency, error rate | Within 10% of baseline | >10% degradation |
+| **Gradual (50%)** | Monitor hallucination rate | <0.1% | >0.15% |
+| **Full (100%)** | Monitor user satisfaction | >80% | <75% |
+
+**Rollback Triggers:** Automatic rollback if any metric degrades beyond threshold.
+
+---
+
+### 2. Golden Datasets
+
+**Principle:** Your eval suite is only as good as your test data.
+
+Golden datasets are curated regression suites of inputs with labeled expected outputs. They're the foundation of offline evals.
+
+**Composition:**
+
+- **60% Core Capabilities:** Common queries representing primary use cases
+- **30% Edge Cases:** Rare but important scenarios (e.g., ambiguous inputs, multi-step reasoning)
+- **10% Adversarial:** Jailbreak attempts, prompt injection, nonsense inputs
+
+**Maintenance Triggers:**
+
+- Production failures (add failed examples weekly)
+- HITL escalations (add human-corrected cases)
+- Quarterly review (remove stale examples, add new patterns)
+
+**Size Guidelines:**
+
+| System Complexity | Minimum Dataset Size | Recommended |
+|-------------------|---------------------|-------------|
+| Simple classifier | 50 examples | 100-200 |
+| Multi-turn agent | 100 examples | 200-500 |
+| Complex workflow | 200 examples | 500-1000 |
+
+**Version Control:** Store in Git with semantic versioning (v1.2.3). Track changes in CHANGELOG.
+
+**Example Structure:**
+
+```pseudocode
+goldenDataset = [
+ {
+ "id": "refund-001",
+ "input": "I want to refund order #12345",
+ "expected_action": "initiate_refund",
+ "expected_confidence": ">0.8",
+ "tags": ["refund", "core"],
+ "added_date": "2024-01-15"
+ },
+ # ... more examples
+]
+```
+
+---
+
+### 3. Unit Testing Agents
+
+**Principle:** Test components in isolation before testing end-to-end workflows.
+
+**Three Types of Unit Tests:**
+
+| Test Type | What It Tests | Example |
+|-----------|--------------|---------|
+| **Tool Calling** | Agent selects correct tool | Query "What's the weather?" → calls `getWeather()` |
+| **Prompt Adherence** | Agent follows instructions | "Respond in JSON" → output is valid JSON |
+| **Synthetic Data** | Agent handles edge cases | Empty input, special characters, long text |
+
+**Example: Tool Calling Test**
+
+```pseudocode
+function testToolSelection():
+ testCases = [
+ {"input": "What's the weather?", "expected_tool": "getWeather"},
+ {"input": "Send email to john@example.com", "expected_tool": "sendEmail"},
+ {"input": "Book a flight to NYC", "expected_tool": "bookFlight"}
+ ]
+
+ for testCase in testCases:
+ output = agent.process(testCase.input)
+ assert output.toolCalled == testCase.expected_tool
+```
+
+---
+
+### 4. Online vs Offline Evals
+
+**Principle:** Offline evals catch regressions. Online evals catch unknowns.
+
+| Aspect | Offline Evals | Online Evals |
+|--------|---------------|--------------|
+| **When** | Pre-deployment (CI/CD) | Post-deployment (production) |
+| **Data** | Golden dataset (known examples) | Live traffic (real users) |
+| **Cost** | Cheap (run on fixed dataset) | Expensive (run on all traffic) |
+| **Purpose** | Catch regressions | Detect drift and unknowns |
+| **Feedback Loop** | Immediate (blocks deployment) | Delayed (triggers alerts) |
+
+**Offline Eval Strategy:**
+
+- Run on every pull request
+- Block merge if accuracy drops >5%
+- Fast feedback (<5 minutes)
+
+**Online Eval Strategy:**
+
+- Sample 10% of production traffic
+- Run async (don't block user responses)
+- Alert if hallucination rate >0.1%
+- Feed failures back to golden dataset
+
+---
+
+### 5. Feedback Loops for Continuous Improvement
+
+**Principle:** Production failures should automatically improve your system.
+
+```mermaid
+graph LR
+ A[Production Traffic] --> B[Collect Failures]
+ B --> C[HITL Review]
+ C --> D[Update Golden Dataset]
+ D --> E[Retrain/Fine-Tune]
+ E --> F[Deploy New Version]
+ F --> A
+```
+
+**Feedback Loop Velocity:**
+
+| Activity | Frequency | Owner |
+|----------|-----------|-------|
+| **Failure Collection** | Real-time | Automated monitoring |
+| **HITL Review** | Daily | Human reviewers |
+| **Golden Dataset Updates** | Weekly | ML engineers |
+| **Model Retraining** | Monthly | ML engineers |
+
+**Key Metrics:**
+
+- **Feedback Loop Latency:** Time from production failure to golden dataset inclusion (target: <7 days)
+- **Dataset Growth Rate:** New examples added per month (target: 10-20%)
+- **Improvement Rate:** Accuracy gain per retrain cycle (target: 1-3%)
+
+---
+
+## Metrics & Observability
+
+Track these metrics to measure quality and lifecycle maturity:
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| **Golden Dataset Accuracy** | >95% | % correct predictions on golden dataset |
+| **Deployment Success Rate** | >90% | % deployments that don't rollback |
+| **User Satisfaction** | >80% | NPS, thumbs up/down, explicit feedback |
+| **Feedback Loop Latency** | <7 days | Time from failure to dataset inclusion |
+| **Eval Runtime** | <5 minutes | P95 time for offline evals in CI/CD |
+| **Cost per Eval** | <$1 | Average cost to run golden dataset eval |
+
+**Observability Requirements:**
+
+- **Chain of Thought (CoT) Logging:** Capture agent reasoning, not just inputs/outputs
+- **Cost Tracking:** Monitor per-workflow and per-tenant costs
+- **Eval Dashboards:** Real-time view of offline/online eval results
+
+---
+
+## Common Pitfalls
+
+1. **No Golden Dataset**
+ - *Problem:* Deploying changes without regression testing
+ - *Fix:* Build golden dataset with 100+ examples before first deployment
+
+2. **Static Golden Dataset**
+ - *Problem:* Dataset becomes stale; doesn't reflect production queries
+ - *Fix:* Weekly updates from production failures and HITL escalations
+
+3. **Insufficient Coverage**
+ - *Problem:* Golden dataset only tests happy paths, not edge cases
+ - *Fix:* 60% core, 30% edge, 10% adversarial distribution
+
+4. **Ignoring Online Metrics**
+ - *Problem:* Offline evals pass but production performance degrades
+ - *Fix:* Set up online eval sampling and drift alerts
+
+5. **Slow Feedback Loops**
+ - *Problem:* Months between production failure and model improvement
+ - *Fix:* Automate failure collection, weekly dataset updates
+
+6. **No Eval-Driven Gates**
+ - *Problem:* Deploying to production without passing evals
+ - *Fix:* Block deployments if offline evals fail
+
+---
+
+*This pillar is part of the [AI Reliability Engineering (AIRE) Standards](../index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/docs/pillars/resilient-architecture.md b/docs/pillars/resilient-architecture.md
new file mode 100644
index 0000000..593d3ee
--- /dev/null
+++ b/docs/pillars/resilient-architecture.md
@@ -0,0 +1,238 @@
+# Pillar 1: Resilient Architecture
+
+## Philosophy
+
+> *"Fail Gracefully, Fail Informatively"* - Every failure should preserve context, enable recovery, and generate learnings.
+
+AI agents introduce non-deterministic failures, long-running workflows, and stateful reasoning chains. Resilience means designing systems that expect failure, recover automatically, and maintain integrity when components degrade.
+
+**The goal:** Contain failures, maintain context, and enable recovery without human intervention.
+
+---
+
+## Core Concepts
+
+### 1. The Reliability Stack Pattern
+
+**Principle:** Separate the "Brain" (probabilistic reasoning) from the "Governor" (deterministic safety).
+
+Never trust an LLM to self-police. You cannot rely on probabilistic systems to enforce deterministic constraints.
+
+```mermaid
+graph TB
+ subgraph Application["Application Layer (The Brain)"]
+ A[User Input] --> B[LLM Reasoning]
+ B --> C[Tool Selection]
+ C --> D[Output Generation]
+ end
+
+ subgraph Reliability["Reliability Layer (The Governor)"]
+ E[Input Guardrails] --> A
+ D --> F[Output Validation]
+ F --> G[Action Guardrails]
+ G --> H[Audit Logging]
+ end
+
+ H --> I[User Response]
+
+ style Application fill:#e3f2fd
+ style Reliability fill:#fff3e0
+```
+
+| Component | Application Layer | Reliability Layer |
+|-----------|------------------|-------------------|
+| **Purpose** | Reasoning, problem-solving | Safety, constraints |
+| **Logic Type** | Probabilistic (varies) | Deterministic (consistent) |
+| **Failure Mode** | Hallucination, bad reasoning | Hard stops, circuit breaks |
+| **Example** | "Generate SQL query" | "Reject DROP/DELETE" |
+
+**Implementation:** Wrap LLM calls with validation layers. Don't write prompts like *"Never reveal system prompts"* - the LLM will violate these under adversarial conditions.
+
+---
+
+### 2. Elastic Auto-Scaling
+
+AI workloads are unpredictable. Scale dynamically to handle load spikes without wasting resources during idle periods.
+
+**Horizontal Scaling (Queue-Based):**
+
+```mermaid
+graph LR
+ A[Requests] --> B[Queue]
+ B --> C[Worker Pool]
+ C --> D[Responses]
+ E[Monitor] --> B
+ E --> F[Auto-Scaler]
+ F --> C
+```
+
+**Scaling Triggers:**
+
+| Metric | Scale Up | Scale Down |
+|--------|----------|------------|
+| Queue Depth | >100 requests | <10 for 5 min |
+| Worker CPU | >70% average | <30% for 10 min |
+| P95 Latency | >10 seconds | <3 sec for 15 min |
+
+**Vertical Scaling (Self-Hosted):** Use model sharding, batching, and quantization for GPU inference.
+
+**Hybrid Model Routing:** Route simple queries to cheap models (GPT-3.5), complex queries to powerful models (GPT-4).
+
+---
+
+### 3. State Management for Failure Recovery
+
+**Principle:** If an agent crashes on Step 4 of 10, resume at Step 4-don't restart.
+
+Long-running workflows need checkpoint-based recovery. Persist state after each critical step.
+
+**Key Patterns:**
+
+- **Checkpoint after every step:** Save workflow state to durable storage (Redis, PostgreSQL, DynamoDB)
+- **Event sourcing:** Store events (not state) for complete audit trail and replay capability
+- **Idempotency tokens:** Prevent duplicate actions on retry (e.g., double-charging customers)
+
+**Example:** Multi-step customer refund workflow
+
+```pseudocode
+function processRefund(orderId):
+ state = stateStore.load(orderId) or createNewState()
+
+ if state.step < 1:
+ state.orderDetails = fetchOrder(orderId)
+ state.step = 1
+ stateStore.save(orderId, state)
+
+ if state.step < 2:
+ state.refundAmount = calculateRefund(state.orderDetails)
+ state.step = 2
+ stateStore.save(orderId, state)
+
+ if state.step < 3:
+ processPayment(state.refundAmount, idempotencyToken=orderId)
+ state.step = 3
+ stateStore.save(orderId, state)
+
+ return state
+```
+
+If the workflow crashes at Step 2, it resumes from Step 2-not Step 1.
+
+---
+
+### 4. Circuit Breakers
+
+**Principle:** Fail fast when services degrade. Don't let one slow dependency cascade failures across your system.
+
+Circuit breakers monitor service health and block requests to degraded services until they recover.
+
+**Three States:**
+
+| State | Behavior | When to Transition |
+|-------|----------|-------------------|
+| **Closed** | Normal operation, requests pass through | N/A |
+| **Open** | Fail fast, reject all requests immediately | After N consecutive failures |
+| **Half-Open** | Allow limited test requests | After timeout period |
+
+**Example Configuration:**
+
+- Open after 5 consecutive failures
+- Stay open for 60 seconds
+- Allow 3 test requests in half-open state
+- Close if 2/3 test requests succeed
+
+**Benefits:**
+
+- Prevent cascading failures
+- Give degraded services time to recover
+- Improve latency (fail fast vs. timeout)
+- Surface infrastructure issues quickly
+
+---
+
+### 5. Fallback Paths
+
+**Principle:** When primary systems fail, degrade gracefully through tiered fallbacks.
+
+Never have single points of failure. Define explicit fallback strategies for every critical component.
+
+**Fallback Hierarchy:**
+
+```mermaid
+graph TD
+ A[User Query] --> B{GPT-4 Available?}
+ B -->|Yes| C[GPT-4 Response]
+ B -->|No| D{GPT-3.5 Available?}
+ D -->|Yes| E[GPT-3.5 Response]
+ D -->|No| F{Rule Engine Available?}
+ F -->|Yes| G[Rule-Based Response]
+ F -->|No| H[Human Escalation]
+```
+
+**Fallback Strategies by Component:**
+
+| Component | Primary | Fallback 1 | Fallback 2 | Fallback 3 |
+|-----------|---------|------------|------------|------------|
+| **LLM API** | GPT-4 | GPT-3.5 | Claude | Human |
+| **Vector DB** | Pinecone | Weaviate | PostgreSQL pgvector | Cached results |
+| **Tool Execution** | Live API | Cached data | Stale data (with warning) | Skip tool |
+
+**Implementation Considerations:**
+
+- **Quality degradation:** Set confidence thresholds (e.g., GPT-3.5 responses marked "lower confidence")
+- **Cost optimization:** Fallbacks can reduce costs during peak load
+- **Testing:** Regularly test fallback paths (chaos engineering)
+
+---
+
+## Metrics & Observability
+
+Track these metrics to measure resilience:
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| **Resumability Rate** | >99% | % workflows that resume successfully after failure |
+| **Circuit Breaker Activations** | <10/day | Count of circuit opens per service per day |
+| **Fallback Usage Rate** | <15% | % requests served by fallback systems |
+| **MTTR** | <5 minutes | Mean time to recovery after failure detection |
+| **State Persistence Overhead** | <50ms | P95 latency added by checkpointing |
+| **Auto-Scaling Response Time** | <2 minutes | Time from load spike to new workers ready |
+
+**Observability Requirements:**
+
+- **State persistence logs:** Track checkpoint writes, failures, and recovery events
+- **Circuit breaker dashboards:** Real-time status of all circuit breakers
+- **Fallback tracking:** Alert when fallback usage exceeds thresholds
+- **Cost tracking:** Monitor cost impact of auto-scaling and fallbacks
+
+---
+
+## Common Pitfalls
+
+1. **Stateless Agents**
+ - *Problem:* Workflows restart from scratch after crashes, wasting time/money
+ - *Fix:* Implement checkpoint-based state management
+
+2. **Tight Coupling**
+ - *Problem:* One service failure cascades to entire system
+ - *Fix:* Use circuit breakers and fallback paths
+
+3. **Over-Reliance on LLM Reasoning**
+ - *Problem:* Trusting LLM to enforce constraints via prompts
+ - *Fix:* Implement The Reliability Stack (separate brain from governor)
+
+4. **No Fallbacks**
+ - *Problem:* Single point of failure (e.g., only one LLM provider)
+ - *Fix:* Define multi-tier fallback strategies
+
+5. **Manual Scaling**
+ - *Problem:* Engineers woken up at 3am to scale infrastructure
+ - *Fix:* Implement queue-based auto-scaling
+
+6. **Brittle Recovery**
+ - *Problem:* Crashes require manual intervention to resume workflows
+ - *Fix:* Use event sourcing and idempotency tokens
+
+---
+
+*This pillar is part of the [AI Reliability Engineering (AIRE) Standards](../index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/docs/pillars/security.md b/docs/pillars/security.md
new file mode 100644
index 0000000..9fe7ac9
--- /dev/null
+++ b/docs/pillars/security.md
@@ -0,0 +1,316 @@
+# Pillar 4: Security
+
+## Philosophy
+
+> *"Embrace Non-Determinism"* - Design systems that succeed despite variance. Agents can be manipulated through adversarial inputs.
+
+Security for AI agents differs from traditional application security. Agents are autonomous decision-makers with dynamic reasoning, making them powerful but unpredictable. A single prompt can trick an agent into unauthorized actions. Security means **constraining autonomy without breaking functionality** through defense in depth.
+
+**The goal:** Multiple security layers that protect even when LLM reasoning fails.
+
+---
+
+## Core Concepts
+
+### 1. Just-in-Time (JIT) Privilege Access
+
+**Principle:** Grant minimum necessary privileges, scoped to specific actions, with automatic expiration.
+
+Traditional static permissions don't work for agents. Agents need **dynamic, context-aware permissions** that adapt to the task.
+
+**Capability-Based Access Control:**
+
+```mermaid
+sequenceDiagram
+ participant User
+ participant Agent
+ participant Auth
+ participant API
+
+ User->>Agent: "Refund order #12345"
+ Agent->>Auth: Request capability: refundOrder(12345)
+ Auth->>Auth: Check role, ownership, policy
+ Auth-->>Agent: Grant scoped token (5 min expiry)
+ Agent->>API: Refund with scoped token
+ API-->>Agent: Success
+ Agent-->>User: "Refund processed"
+```
+
+**Implementation Pattern:**
+
+```pseudocode
+function executeAction(userRequest, action):
+ # Step 1: Agent determines required action
+ actionPlan = llm.parse(userRequest)
+
+ # Step 2: Request JIT capability
+ capability = authService.requestCapability(
+ user=userRequest.userId,
+ action=actionPlan.action,
+ resourceId=actionPlan.resourceId,
+ expiresIn=5_minutes
+ )
+
+ if not capability.granted:
+ return "Unauthorized: " + capability.reason
+
+ # Step 3: Execute with scoped token
+ result = protectedAPI.call(
+ action=actionPlan.action,
+ token=capability.scopedToken
+ )
+
+ return result
+```
+
+**Key Properties:**
+
+- **Scoped:** Token valid only for specific action + resource (e.g., `refundOrder:12345`)
+- **Short-lived:** Expires in <5 minutes
+- **One-time use:** Token invalidated after action completes
+- **Auditable:** All token grants logged with user, action, timestamp
+
+**Step-Up Authentication:** For high-risk actions (large refunds, account deletion), require additional verification (2FA, email confirmation).
+
+---
+
+### 2. Audit Logs for Internal Thinking
+
+**Principle:** Log agent reasoning, not just inputs/outputs. Capture the "why" behind decisions.
+
+Traditional logs capture API calls. Agent logs must capture **Chain of Thought (CoT)** reasoning for incident investigation.
+
+**What to Log:**
+
+| Event Type | What to Capture | Retention |
+|------------|----------------|-----------|
+| **User Interactions** | User input, agent output, session ID | 90 days |
+| **CoT Reasoning** | LLM reasoning steps, confidence scores | 30 days |
+| **Tool Calls** | Tool name, parameters, result, latency | 90 days |
+| **Privileged Actions** | Action type, user ID, resource ID, authorization decision | 1 year |
+| **Security Events** | Prompt injection attempts, jailbreak attempts, guardrail blocks | 1 year |
+
+**Structured Logging Format:**
+
+```pseudocode
+auditLog = {
+ "timestamp": "2024-01-15T10:30:00Z",
+ "sessionId": "sess_abc123",
+ "userId": "user_456",
+ "event": "privileged_action",
+ "action": "refundOrder",
+ "resourceId": "order_12345",
+ "reasoning": "Customer requested refund within 30-day window",
+ "confidence": 0.92,
+ "authorized": true,
+ "toolCalls": ["getOrderDetails", "processRefund"],
+ "latency_ms": 1250
+}
+```
+
+**Use Cases:**
+
+- **Incident investigation:** "Why did the agent refund this order?"
+- **Security audits:** "Did any agents attempt unauthorized actions?"
+- **Debugging:** "Why did the agent choose the wrong tool?"
+
+---
+
+### 3. Guardrails (Three-Layer Defense)
+
+**Principle:** Deterministic hard stops at input, output, and action layers.
+
+Guardrails are **non-negotiable constraints** that override LLM reasoning. Never rely on prompts to enforce security.
+
+**Layered Defense Architecture:**
+
+```mermaid
+graph TB
+ A[User Input] --> B[Input Guardrails]
+ B -->|Pass| C[LLM Reasoning]
+ B -->|Block| H[Reject Request]
+ C --> D[Output Guardrails]
+ D -->|Pass| E[Action Guardrails]
+ D -->|Block| H
+ E -->|Pass| F[Execute Action]
+ E -->|Block| H
+ F --> G[Audit Log]
+```
+
+**Three Layers:**
+
+| Layer | Purpose | Examples |
+|-------|---------|----------|
+| **Input Guardrails** | Block malicious inputs before LLM | Prompt injection detection, PII redaction, profanity filter |
+| **Output Guardrails** | Validate LLM outputs | Sensitive data leakage prevention, factuality check, schema validation |
+| **Action Guardrails** | Constrain agent actions | Rate limits, monetary limits, forbidden operations |
+
+**Implementation Pattern:**
+
+```pseudocode
+function processWithGuardrails(userInput):
+ # Layer 1: Input Guardrails
+ if promptInjectionDetector.detect(userInput):
+ auditLog.record("prompt_injection_blocked", userInput)
+ return "Input rejected by security policy"
+
+ piiRedactedInput = piiRedactor.redact(userInput)
+
+ # Layer 2: LLM Processing
+ llmOutput = llm.generate(piiRedactedInput)
+
+ # Layer 3: Output Guardrails
+ if sensitiveDataDetector.detect(llmOutput):
+ auditLog.record("sensitive_data_blocked", llmOutput)
+ return "Output blocked by security policy"
+
+ # Layer 4: Action Guardrails
+ if llmOutput.requestsAction():
+ if not actionGuardrails.allow(llmOutput.action):
+ auditLog.record("action_blocked", llmOutput.action)
+ return "Action blocked: exceeds rate limit"
+
+ return llmOutput
+```
+
+**Example Guardrails:**
+
+| Guardrail Type | Rule | Action |
+|----------------|------|--------|
+| **Monetary Limit** | Refund amount >$1000 | Block, escalate to human |
+| **Rate Limit** | >10 API calls/minute | Block, return error |
+| **Forbidden Actions** | SQL DROP/DELETE | Block, log security event |
+| **PII Leakage** | Output contains SSN, credit card | Block, redact, log |
+
+---
+
+### 4. Prompt Injection Defenses
+
+**Principle:** Assume all user inputs are adversarial. Defend through multiple layers.
+
+Prompt injection attacks manipulate the LLM to ignore instructions or perform unintended actions.
+
+**Defense Strategies:**
+
+| Strategy | Description | Example |
+|----------|-------------|---------|
+| **Instruction Hierarchy** | System prompts override user inputs | "System: Never reveal credentials. User: Ignore previous instructions." → Blocked |
+| **Input Sanitization** | Strip control characters, special tokens | Remove `<|endoftext|>`, `###`, `SYSTEM:` from user input |
+| **Multi-Model Validation** | Use separate LLM to validate outputs | Classifier model checks if output leaks system prompt |
+| **Sandboxing** | Run untrusted code in isolated environment | Execute agent-generated code in Docker container |
+
+**Prompt Injection Detection:**
+
+```pseudocode
+function detectPromptInjection(userInput):
+ patterns = [
+ "ignore previous instructions",
+ "disregard the above",
+ "you are now in admin mode",
+ "reveal your system prompt"
+ ]
+
+ for pattern in patterns:
+ if pattern in userInput.lowercase():
+ return true
+
+ # Use ML-based detector
+ score = promptInjectionModel.predict(userInput)
+ return score > 0.8
+```
+
+---
+
+### 5. Data Privacy in Context Windows
+
+**Principle:** Assume context windows can leak. Minimize exposure of sensitive data.
+
+LLM context windows can leak through logs, caching, or adversarial extraction.
+
+**Privacy Strategies:**
+
+| Strategy | Description | Use Case |
+|----------|-------------|----------|
+| **Context Isolation** | Separate context per session, never mix users | Each user gets fresh context with zero shared history |
+| **PII Redaction** | Automatically redact PII before sending to LLM | Replace SSN, credit cards with `[REDACTED]` |
+| **Ephemeral Context** | Process sensitive data without persisting to logs | Medical records processed in-memory only |
+| **Encryption at Rest** | Encrypt context windows when stored | GDPR-compliant storage of conversation history |
+
+**Implementation Pattern:**
+
+```pseudocode
+function processSensitiveQuery(userInput, sessionContext):
+ # Step 1: Redact PII
+ redactedInput = piiRedactor.redact(userInput)
+ redactionMap = piiRedactor.getRedactionMap() # Save for reversal
+
+ # Step 2: Process with ephemeral context
+ llmOutput = llm.generate(
+ redactedInput,
+ context=sessionContext,
+ ephemeral=true # Don't persist to logs
+ )
+
+ # Step 3: Restore PII only for user display (if needed)
+ if userNeedsPII:
+ finalOutput = restorePII(llmOutput, redactionMap)
+ else:
+ finalOutput = llmOutput
+
+ return finalOutput
+```
+
+**Compliance:** GDPR, HIPAA, CCPA require data minimization and encryption.
+
+---
+
+## Metrics & Observability
+
+Track these metrics to measure security posture:
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| **Prompt Injection Attempts** | <10/day | Count of blocked prompt injection attacks |
+| **Jailbreak Success Rate** | <0.1% | % adversarial inputs that bypass guardrails |
+| **PII Leakage Incidents** | 0 | Count of PII exposed in logs or outputs |
+| **Privileged Action Approval Rate** | >95% | % legitimate actions granted JIT access |
+| **MTTD (Mean Time to Detect)** | <5 minutes | Time from security event to alert |
+| **MTTR (Mean Time to Respond)** | <30 minutes | Time from alert to mitigation |
+
+**Security Monitoring:**
+
+- **Real-time alerts:** Prompt injection attempts, jailbreak successes, PII leakage
+- **Anomaly detection:** Unusual action patterns, privilege escalation attempts
+- **Regular audits:** Review audit logs weekly for suspicious activity
+
+---
+
+## Common Pitfalls
+
+1. **Overly Permissive Agents**
+ - *Problem:* Agent has access to all APIs/databases without scoping
+ - *Fix:* Implement JIT privilege access with scoped tokens
+
+2. **No Input Validation**
+ - *Problem:* Accepting all user inputs without sanitization
+ - *Fix:* Deploy input guardrails (prompt injection detection, PII redaction)
+
+3. **Insufficient Logging**
+ - *Problem:* Only logging inputs/outputs, not reasoning
+ - *Fix:* Capture Chain of Thought reasoning for incident investigation
+
+4. **Guardrails as Afterthought**
+ - *Problem:* Relying on prompts to enforce security policies
+ - *Fix:* Implement deterministic guardrails at input, output, action layers
+
+5. **Ignoring Adversarial Inputs**
+ - *Problem:* Not testing against prompt injection and jailbreak attempts
+ - *Fix:* Regular red team exercises with adversarial testing
+
+6. **PII in Logs**
+ - *Problem:* Logging full user inputs with SSNs, credit cards
+ - *Fix:* Automatic PII redaction before logging
+
+---
+
+*This pillar is part of the [AI Reliability Engineering (AIRE) Standards](../index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/docs/principles.md b/docs/principles.md
new file mode 100644
index 0000000..f3cc80b
--- /dev/null
+++ b/docs/principles.md
@@ -0,0 +1,201 @@
+# AIRE Principles
+
+*Guiding tenets for AI Reliability Engineering, inspired by SRE.*
+
+These five principles define the philosophical foundation of AIRE. They inform the practices detailed in the five pillars and help teams make trade-off decisions when designing reliable AI systems.
+
+---
+
+## 1. Embrace Non-Determinism
+
+> *Accept that identical inputs will produce variable outputs. Design systems that succeed despite variance, not systems that assume consistency.*
+
+### What It Means
+
+Traditional software is deterministic: same input → same output, always. AI systems are probabilistic: same input → different outputs across runs. This is not a bug - it's inherent to how LLMs work.
+
+**Bad Response:** "We need to make the AI deterministic by lowering temperature to 0 and using seed values."
+
+**Good Response:** "We accept variance. We'll use structured outputs to constrain the format and guardrails to enforce constraints."
+
+### When to Apply
+
+- **Designing fallback paths:** Expect primary system to fail non-deterministically → need backup plans
+- **Testing:** Use semantic similarity matching (not exact string matching)
+- **Monitoring:** Track distributions (P50, P95 latency), not single values
+
+### Anti-Patterns
+
+- ❌ Expecting 100% consistency in outputs
+- ❌ Over-tuning prompts to eliminate all variance
+- ❌ Treating test failures as deterministic bugs
+
+### Related Patterns
+
+- **[Pillar 1: Resilient Architecture](./pillars/resilient-architecture.md)** - Fallback paths, circuit breakers
+- **[Pillar 4: Security](./pillars/security.md)** - Adversarial robustness through guardrails
+
+---
+
+## 2. Reliability is a Feature
+
+> *Reliability competes with velocity for engineering resources. Treat it as a first-class product requirement with explicit budgets, not an afterthought.*
+
+### What It Means
+
+Reliability is not "free" - it requires engineering time, infrastructure cost, and operational overhead. Teams must explicitly budget for reliability work (golden datasets, observability, guardrails) alongside feature work.
+
+**Bad Response:** "We'll worry about reliability after we launch."
+
+**Good Response:** "We allocate 20% of each sprint to reliability work: golden dataset updates, eval pipeline maintenance, incident reviews."
+
+### When to Apply
+
+- **Sprint planning:** Dedicate sprint capacity to reliability tasks
+- **Roadmap prioritization:** Reliability features (evals, guardrails) compete with product features
+- **Hiring:** Hire for reliability skills (observability, testing, incident response)
+
+### Anti-Patterns
+
+- ❌ "We'll add tests later"
+- ❌ "Reliability is the ops team's problem"
+- ❌ Ignoring tech debt from fast-moving prototypes
+
+### Related Patterns
+
+- **[Pillar 3: Quality & Lifecycle](./pillars/quality-lifecycle.md)** - Evals-driven deployments, feedback loops
+
+---
+
+## 3. Measure, Don't Assume
+
+> *If you cannot quantify the reliability of your AI system, you do not have a reliable AI system. Intuition is not evidence.*
+
+### What It Means
+
+"Vibes-based" development ("it feels like it's working") is not acceptable. Reliability must be measured with concrete metrics: hallucination rate, HITL rate, uptime, MTTR.
+
+**Bad Response:** "Our agent seems pretty good. Users aren't complaining much."
+
+**Good Response:** "Our hallucination rate is 0.08% (measured on 500 samples). HITL rate is 12%. We're tracking toward <10%."
+
+### When to Apply
+
+- **Deployment decisions:** Block deployments if metrics degrade (accuracy drops >5%)
+- **Model selection:** Choose models based on measured performance, not marketing claims
+- **Incident response:** Use metrics to diagnose root causes (latency P95 spiked = circuit breaker opened)
+
+### Anti-Patterns
+
+- ❌ Relying on "spot checks" instead of golden datasets
+- ❌ Shipping without baseline metrics
+- ❌ Ignoring drift because "it looks fine"
+
+### Related Patterns
+
+- **[Pillar 2: Cognitive Reliability](./pillars/cognitive-reliability.md)** - Hallucination rate, groundedness score
+- **[Pillar 3: Quality & Lifecycle](./pillars/quality-lifecycle.md)** - Golden datasets, offline/online evals
+
+---
+
+## 4. Fail Gracefully, Fail Informatively
+
+> *Every failure should preserve context, enable recovery, and generate learnings. Silent failures are unacceptable.*
+
+### What It Means
+
+Failures will happen (see Principle 1: Embrace Non-Determinism). The question is: does your system handle failures gracefully, or does it crash spectacularly?
+
+- **Graceful failure:** Circuit breaker opens → fallback to GPT-3.5 → user gets degraded but working response
+- **Spectacular failure:** LLM API timeout → agent crashes → user sees 500 error
+
+**Informative failure:** Log reasoning, state, and context so you can debug later.
+
+**Bad Response:** Agent crashes with no logs. No idea what went wrong.
+
+**Good Response:** Agent saves checkpoint, logs error with full context (user query, reasoning, state), returns user-friendly error message, escalates to human.
+
+### When to Apply
+
+- **State management:** Save checkpoints so workflows can resume after crash
+- **Logging:** Log Chain of Thought reasoning, not just inputs/outputs
+- **User experience:** Never show raw LLM errors - translate to user-friendly messages
+
+### Anti-Patterns
+
+- ❌ Silent failures (logs say "success" but output is garbage)
+- ❌ Stateless agents (crash = restart from Step 1)
+- ❌ No audit trail (can't investigate incidents)
+
+### Related Patterns
+
+- **[Pillar 1: Resilient Architecture](./pillars/resilient-architecture.md)** - State management, circuit breakers, fallback paths
+- **[Pillar 3: Quality & Lifecycle](./pillars/quality-lifecycle.md)** - Chain of Thought logging
+
+---
+
+## 5. Humans as Fallback, Not Crutch
+
+> *Design for autonomous operation. Human escalation is a safety net for edge cases, not a substitute for robust engineering.*
+
+### What It Means
+
+HITL (Human-in-the-Loop) is essential for high-stakes decisions, but if 50% of requests need human review, your system isn't working. The goal is to **reduce HITL over time** through active learning.
+
+**Bad Response:** "When the agent isn't sure, we just ask a human."
+
+**Good Response:** "HITL rate started at 30%. After 3 months of active learning (adding HITL corrections to golden dataset), we're down to 8%."
+
+### When to Apply
+
+- **Confidence thresholds:** Only escalate to humans when confidence <0.7
+- **Staged rollout:** Start with 100% HITL, reduce to 10% over time
+- **Active learning:** Use HITL corrections to retrain models
+
+### Anti-Patterns
+
+- ❌ HITL as default (agent always asks human before acting)
+- ❌ No confidence calibration (agent doesn't know when it's unsure)
+- ❌ Ignoring HITL feedback (corrections not added to golden dataset)
+
+### Related Patterns
+
+- **[Pillar 2: Cognitive Reliability](./pillars/cognitive-reliability.md)** - HITL protocols, confidence calibration
+- **[Pillar 3: Quality & Lifecycle](./pillars/quality-lifecycle.md)** - Feedback loops, golden dataset updates
+
+---
+
+## Applying the Principles
+
+These principles are not rules - they're guides for making trade-off decisions. When facing a design choice, ask:
+
+1. **Embrace Non-Determinism:** Am I designing for consistency or resilience?
+2. **Reliability is a Feature:** How much engineering time am I allocating to reliability?
+3. **Measure, Don't Assume:** What metrics will I use to validate this works?
+4. **Fail Gracefully:** What happens when this component fails?
+5. **Humans as Fallback:** Can this system improve over time without human intervention?
+
+---
+
+## Example: Designing a Refund Agent
+
+Let's apply all 5 principles to designing a customer refund agent:
+
+### Scenario
+Build an agent that processes refund requests automatically.
+
+### Design Decisions
+
+| Principle | Application |
+|-----------|-------------|
+| **1. Embrace Non-Determinism** | Agent may extract different refund reasons from same query → Use structured outputs to constrain format |
+| **2. Reliability is a Feature** | Allocate 2 weeks for golden dataset (100 refund examples), 1 week for guardrails |
+| **3. Measure, Don't Assume** | Track: Refund approval accuracy (>95%), HITL rate (<10%), Fraud detection rate |
+| **4. Fail Gracefully** | If fraud detection fails → escalate to human with full context (not crash) |
+| **5. Humans as Fallback** | Start with HITL for all refunds >$100. After 1 month, reduce threshold to >$500 using active learning |
+
+**Result:** System that's reliable, measurable, and improves over time.
+
+---
+
+*These principles are part of the [AI Reliability Engineering (AIRE) Standards](./index.md). Licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/).*
diff --git a/mkdocs.yml b/mkdocs.yml
index 00fc7ae..170b4c7 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -1,4 +1,4 @@
-site_name: AI Reliability Standards
+site_name: AI Reliability Engineering (AIRE) Standards
site_url: https://aire.exosphere.host
site_description: The open standard for AI Reliability Engineering (AIRE). Architectural patterns and best practices for production-grade Agents.
@@ -68,19 +68,20 @@ plugins:
- search
- llmstxt:
markdown_description: |
- The AI Reliability Engineering (AIRE) Standards is an open industry framework and playbook for building production-grade AI agents and reliable LLM workflows. Maintained by ExosphereHost Inc.
+ The AI Reliability Engineering (AIRE) Standards is an open industry framework for building production-grade AI agents. Maintained by ExosphereHost Inc.
- This documentation defines the "Gold Standard" for AI Engineering across four core pillars:
-
- 1. The Reliability Stack: Architectural patterns for separating probabilistic reasoning (LLMs) from deterministic guardrails.
- 2. Eval-Driven Development (EDD): Methodologies for moving from "vibes-based" testing to rigorous regression suites and "Golden Datasets."
- 3. Durable Execution: Standards for fault-tolerant, long-running agentic workflows, state persistence, and graceful degradation.
- 4. Observability 2.0: Best practices for tracing "Chain of Thought" (CoT), tool outputs, and debugging non-deterministic logic.
+ This implementation guide defines reliability engineering practices across five core pillars:
+
+ 1. Resilient Architecture: Fault-tolerant design patterns for autonomous systems (circuit breakers, state management, fallback paths, elastic scaling).
+ 2. Cognitive Reliability: Ensuring trustworthy outputs through self-reflection, structured outputs, human-in-the-loop protocols, and drift detection.
+ 3. Quality & Lifecycle: Evals-driven deployment pipelines, golden datasets, feedback loops, and continuous improvement practices.
+ 4. Security: Just-in-time privilege access, audit logging, guardrails, and defenses against adversarial inputs.
+ 5. Operational Excellence: Performance targets, quality budgets, team structures, progressive autonomy maturity model, and reliability review practices.
This resource is intended for CTOs, AI Architects, and Engineering Leaders seeking proven patterns to scale agents from prototype to production.
sections:
Introduction:
- - README.md
+ - index.md
markdown_extensions:
- attr_list
@@ -95,9 +96,26 @@ markdown_extensions:
- pymdownx.details
- pymdownx.tabbed:
alternate_style: true
+ - pymdownx.superfences:
+ custom_fences:
+ - name: mermaid
+ class: mermaid
+ format: !!python/name:pymdownx.superfences.fence_code_format
extra_css:
- stylesheets/extra.css
nav:
- - Introduction: README.md
\ No newline at end of file
+ - Home:
+ - Home: index.md
+ - Getting Started: getting-started.md
+ - AIRE Principles: principles.md
+ - Metrics Framework: appendix/metrics-framework.md
+ - Pillars:
+ - Resilient Architecture: pillars/resilient-architecture.md
+ - Cognitive Reliability: pillars/cognitive-reliability.md
+ - Quality & Lifecycle: pillars/quality-lifecycle.md
+ - Security: pillars/security.md
+ - Team Culture: pillars/operational-excellence.md
+ - Appendix:
+ - Glossary: appendix/glossary.md