Aux LLM pass to enforce content policies on public output

## Current behavior

Content policies (e.g. "don't publish PII") are enforced only by the primary LLM's system prompt and per-skill instructions. There is no independent verification layer before content is emitted to public-facing surfaces.

## Gap

System-prompt-level policy compliance is probabilistic — the primary model can miss edge cases, especially when juggling complex multi-step tasks. When output becomes public information (Slack messages in broad channels, GitHub issues, PR descriptions, canvases), a single-pass approach has no safety net for policy violations.

Related: #11 covers PII scrubbing for Sentry telemetry via field allowlists. This issue proposes a more general mechanism that can enforce arbitrary content policies across all output types.

## Proposed approach

Add an auxiliary LLM call that acts as a second-pass content policy agent:

- Runs after the primary model generates content destined for a public surface
- Receives the draft content + a set of encoded policies (PII suppression, sensitive data handling, internal-only context stripping, etc.)
- Returns either an approval or a redacted/flagged version
- Policies are defined as a reviewable, checked-in spec — not just prompt text

Key design considerations:

- **Scope trigger**: define which output actions route through the second pass (e.g. channel posts, issue creation, canvas writes) vs. which are exempt (e.g. ephemeral thread replies in private channels)
- **Latency budget**: the aux call adds latency; may need a fast model or async check-and-retract pattern
- **Policy spec format**: structured enough to be testable, flexible enough to cover "don't leak internal URLs," "strip customer names," "no PII," etc.
- **Failure mode**: what happens when the aux model is unavailable or disagrees — block, warn, or log-and-emit

Action taken on behalf of David Cramer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aux LLM pass to enforce content policies on public output #374

Current behavior

Gap

Proposed approach

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Aux LLM pass to enforce content policies on public output #374

Description

Current behavior

Gap

Proposed approach

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions