Current behavior
Content policies (e.g. "don't publish PII") are enforced only by the primary LLM's system prompt and per-skill instructions. There is no independent verification layer before content is emitted to public-facing surfaces.
Gap
System-prompt-level policy compliance is probabilistic — the primary model can miss edge cases, especially when juggling complex multi-step tasks. When output becomes public information (Slack messages in broad channels, GitHub issues, PR descriptions, canvases), a single-pass approach has no safety net for policy violations.
Related: #11 covers PII scrubbing for Sentry telemetry via field allowlists. This issue proposes a more general mechanism that can enforce arbitrary content policies across all output types.
Proposed approach
Add an auxiliary LLM call that acts as a second-pass content policy agent:
- Runs after the primary model generates content destined for a public surface
- Receives the draft content + a set of encoded policies (PII suppression, sensitive data handling, internal-only context stripping, etc.)
- Returns either an approval or a redacted/flagged version
- Policies are defined as a reviewable, checked-in spec — not just prompt text
Key design considerations:
- Scope trigger: define which output actions route through the second pass (e.g. channel posts, issue creation, canvas writes) vs. which are exempt (e.g. ephemeral thread replies in private channels)
- Latency budget: the aux call adds latency; may need a fast model or async check-and-retract pattern
- Policy spec format: structured enough to be testable, flexible enough to cover "don't leak internal URLs," "strip customer names," "no PII," etc.
- Failure mode: what happens when the aux model is unavailable or disagrees — block, warn, or log-and-emit
Action taken on behalf of David Cramer.
Current behavior
Content policies (e.g. "don't publish PII") are enforced only by the primary LLM's system prompt and per-skill instructions. There is no independent verification layer before content is emitted to public-facing surfaces.
Gap
System-prompt-level policy compliance is probabilistic — the primary model can miss edge cases, especially when juggling complex multi-step tasks. When output becomes public information (Slack messages in broad channels, GitHub issues, PR descriptions, canvases), a single-pass approach has no safety net for policy violations.
Related: #11 covers PII scrubbing for Sentry telemetry via field allowlists. This issue proposes a more general mechanism that can enforce arbitrary content policies across all output types.
Proposed approach
Add an auxiliary LLM call that acts as a second-pass content policy agent:
Key design considerations:
Action taken on behalf of David Cramer.