feat(memory): Scan memory content for prompt-injection and exfiltration patterns before write and at load

## Background

Sub-issue of #3011.

Memory entries are injected verbatim into the system prompt (or turn context) at the start of every session via the `inject_memories` builtin (#3015). A compromised tool result, a malicious web page visited during a task, or a supply-chain attack on a sister session can write a poisoned entry that silently redirects the agent in every subsequent session until the entry is manually removed.

This risk is higher for memory than for ordinary tool output because:

1. Memory enters the **system prompt as a frozen snapshot** (#3017) — it persists across the entire session and across all future sessions until explicitly deleted.
2. The user rarely inspects raw memory content, so a poisoned entry can go unnoticed for a long time.

The fix is a two-layer defence: scan at **write time** (reject before persisting) and sanitise at **load / inject time** (block from system prompt even if a poisoned entry already exists on disk, while keeping it visible so the user can remove it).

## Proposed design

### 1. Threat-pattern library (`pkg/memory/security/`)

Create `pkg/memory/security/threats.go` with a compiled set of regular expressions covering:

- **Prompt-injection patterns**: `ignore previous instructions`, `disregard your system prompt`, `new persona`, `you are now`, `your true instructions are`, ANSI escape injection, zero-width character smuggling, Unicode direction-override sequences (RLO/LRO).
- **Exfiltration patterns**: `send to`, `POST to`, `curl`, `wget`, `exfiltrate`, `base64` in suspicious context, known exfil URL shapes.

Patterns are organised by **scope** (`strict` vs `relaxed`). Memory scanning always uses `strict` (broadest set) because entries are user-curated and the user can always rewrite a blocked entry.

```go
// pkg/memory/security/threats.go
package security

type Scope string

const (
    ScopeStrict  Scope = "strict"
    ScopeRelaxed Scope = "relaxed"
)

// ScanContent returns a list of threat IDs matched in content, or nil.
func ScanContent(content string, scope Scope) []string { … }

// FirstThreatMessage returns a human-readable error for the first match, or "".
func FirstThreatMessage(content string, scope Scope) string { … }
```

### 2. Write-time scanning

In each memory-write path (`pkg/tools/builtin/memory/add_memory.go`, `update_memory.go`) call `security.FirstThreatMessage(content, ScopeStrict)` before touching the DB. On a non-empty return, reject the write with a structured error:

```json
{
  "success": false,
  "error": "Content blocked: matched prompt-injection pattern 'ignore_previous_instructions'. Rephrase the entry."
}
```

The entry is never written to the DB.

### 3. Load-time sanitisation (snapshot building)

When `inject_memories` (#3015 / #3017) builds the system-prompt snapshot:

1. For each entry in the DB, call `security.ScanContent(entry.Content, ScopeStrict)`.
2. If threat IDs are returned, **replace the entry text in the snapshot** with:
   ```
   [BLOCKED: entry contained threat pattern(s): <ids>. Use delete_memory(id=…) to remove it.]
   ```
3. The original entry remains in the DB so the user can inspect it with `get_memories` and delete it with `delete_memory`.

This preserves the prefix-cache invariant (#3017): the snapshot is built once from deterministic DB bytes and is byte-stable for the whole session.

### 4. `get_memories` response flags

Extend the `get_memories` response in `pkg/tools/builtin/memory/get_memories.go` to include a `blocked: true` field on entries that would be blocked at inject time:

```json
{
  "id": "abc123",
  "content": "ignore all previous instructions …",
  "blocked": true,
  "block_reason": ["ignore_previous_instructions"]
}
```

## Implementation checklist

- [ ] `pkg/memory/security/threats.go` — threat-pattern library with `strict` and `relaxed` scopes; compiled regexps; `ScanContent`, `FirstThreatMessage`
- [ ] `pkg/memory/security/threats_test.go` — unit tests for each pattern class (injection, exfil, unicode smuggling); assert clean content passes; assert poisoned content is caught
- [ ] `pkg/tools/builtin/memory/add_memory.go` — call `FirstThreatMessage` before insert; return structured error on match
- [ ] `pkg/tools/builtin/memory/update_memory.go` — same scan on new content
- [ ] `pkg/hooks/builtins/inject_memories.go` — scan each entry during snapshot build; replace blocked entries with `[BLOCKED: …]` placeholder in snapshot; leave DB row untouched
- [ ] `pkg/tools/builtin/memory/get_memories.go` — add `blocked` and `block_reason` fields to response for entries that fail the scan
- [ ] Integration test: write a poisoned entry via direct DB insert (bypassing write-scan); confirm it appears as `[BLOCKED: …]` in the injected snapshot and as `blocked: true` in `get_memories`; confirm `delete_memory` removes it

## Acceptance criteria

- [ ] `add_memory` with injection-pattern content returns a structured error and nothing is written to the DB
- [ ] `update_memory` with injection-pattern content in the new value returns a structured error; existing entry unchanged
- [ ] A clean, legitimate entry is never blocked
- [ ] A pre-existing poisoned DB entry (e.g. written by an external process) does not appear verbatim in the system-prompt snapshot; a `[BLOCKED: …]` placeholder appears instead
- [ ] `get_memories` returns `blocked: true` for any entry that would be blocked at inject time
- [ ] Blocked entries are still deletable via `delete_memory`
- [ ] `go test -race` passes on the security package
- [ ] ≥80% line coverage on `pkg/memory/security/`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(memory): Scan memory content for prompt-injection and exfiltration patterns before write and at load #3020

Background

Proposed design

1. Threat-pattern library (`pkg/memory/security/`)

2. Write-time scanning

3. Load-time sanitisation (snapshot building)

4. `get_memories` response flags

Implementation checklist

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(memory): Scan memory content for prompt-injection and exfiltration patterns before write and at load #3020

Description

Background

Proposed design

1. Threat-pattern library (pkg/memory/security/)

2. Write-time scanning

3. Load-time sanitisation (snapshot building)

4. get_memories response flags

Implementation checklist

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. Threat-pattern library (`pkg/memory/security/`)

4. `get_memories` response flags