Skip to content

feat(memory): Scan memory content for prompt-injection and exfiltration patterns before write and at load #3020

@hamza-jeddad

Description

@hamza-jeddad

Background

Sub-issue of #3011.

Memory entries are injected verbatim into the system prompt (or turn context) at the start of every session via the inject_memories builtin (#3015). A compromised tool result, a malicious web page visited during a task, or a supply-chain attack on a sister session can write a poisoned entry that silently redirects the agent in every subsequent session until the entry is manually removed.

This risk is higher for memory than for ordinary tool output because:

  1. Memory enters the system prompt as a frozen snapshot (feat(memory): Frozen snapshot + cache invalidation for inject_memories #3017) — it persists across the entire session and across all future sessions until explicitly deleted.
  2. The user rarely inspects raw memory content, so a poisoned entry can go unnoticed for a long time.

The fix is a two-layer defence: scan at write time (reject before persisting) and sanitise at load / inject time (block from system prompt even if a poisoned entry already exists on disk, while keeping it visible so the user can remove it).

Proposed design

1. Threat-pattern library (pkg/memory/security/)

Create pkg/memory/security/threats.go with a compiled set of regular expressions covering:

  • Prompt-injection patterns: ignore previous instructions, disregard your system prompt, new persona, you are now, your true instructions are, ANSI escape injection, zero-width character smuggling, Unicode direction-override sequences (RLO/LRO).
  • Exfiltration patterns: send to, POST to, curl, wget, exfiltrate, base64 in suspicious context, known exfil URL shapes.

Patterns are organised by scope (strict vs relaxed). Memory scanning always uses strict (broadest set) because entries are user-curated and the user can always rewrite a blocked entry.

// pkg/memory/security/threats.go
package security

type Scope string

const (
    ScopeStrict  Scope = "strict"
    ScopeRelaxed Scope = "relaxed"
)

// ScanContent returns a list of threat IDs matched in content, or nil.
func ScanContent(content string, scope Scope) []string { … }

// FirstThreatMessage returns a human-readable error for the first match, or "".
func FirstThreatMessage(content string, scope Scope) string { … }

2. Write-time scanning

In each memory-write path (pkg/tools/builtin/memory/add_memory.go, update_memory.go) call security.FirstThreatMessage(content, ScopeStrict) before touching the DB. On a non-empty return, reject the write with a structured error:

{
  "success": false,
  "error": "Content blocked: matched prompt-injection pattern 'ignore_previous_instructions'. Rephrase the entry."
}

The entry is never written to the DB.

3. Load-time sanitisation (snapshot building)

When inject_memories (#3015 / #3017) builds the system-prompt snapshot:

  1. For each entry in the DB, call security.ScanContent(entry.Content, ScopeStrict).
  2. If threat IDs are returned, replace the entry text in the snapshot with:
    [BLOCKED: entry contained threat pattern(s): <ids>. Use delete_memory(id=…) to remove it.]
    
  3. The original entry remains in the DB so the user can inspect it with get_memories and delete it with delete_memory.

This preserves the prefix-cache invariant (#3017): the snapshot is built once from deterministic DB bytes and is byte-stable for the whole session.

4. get_memories response flags

Extend the get_memories response in pkg/tools/builtin/memory/get_memories.go to include a blocked: true field on entries that would be blocked at inject time:

{
  "id": "abc123",
  "content": "ignore all previous instructions …",
  "blocked": true,
  "block_reason": ["ignore_previous_instructions"]
}

Implementation checklist

  • pkg/memory/security/threats.go — threat-pattern library with strict and relaxed scopes; compiled regexps; ScanContent, FirstThreatMessage
  • pkg/memory/security/threats_test.go — unit tests for each pattern class (injection, exfil, unicode smuggling); assert clean content passes; assert poisoned content is caught
  • pkg/tools/builtin/memory/add_memory.go — call FirstThreatMessage before insert; return structured error on match
  • pkg/tools/builtin/memory/update_memory.go — same scan on new content
  • pkg/hooks/builtins/inject_memories.go — scan each entry during snapshot build; replace blocked entries with [BLOCKED: …] placeholder in snapshot; leave DB row untouched
  • pkg/tools/builtin/memory/get_memories.go — add blocked and block_reason fields to response for entries that fail the scan
  • Integration test: write a poisoned entry via direct DB insert (bypassing write-scan); confirm it appears as [BLOCKED: …] in the injected snapshot and as blocked: true in get_memories; confirm delete_memory removes it

Acceptance criteria

  • add_memory with injection-pattern content returns a structured error and nothing is written to the DB
  • update_memory with injection-pattern content in the new value returns a structured error; existing entry unchanged
  • A clean, legitimate entry is never blocked
  • A pre-existing poisoned DB entry (e.g. written by an external process) does not appear verbatim in the system-prompt snapshot; a [BLOCKED: …] placeholder appears instead
  • get_memories returns blocked: true for any entry that would be blocked at inject time
  • Blocked entries are still deletable via delete_memory
  • go test -race passes on the security package
  • ≥80% line coverage on pkg/memory/security/

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/agentFor work that has to do with the general agent loop/agentic features of the apparea/securityAuthentication, authorization, secrets, vulnerabilitiesarea/toolsFor features/issues/fixes related to the usage of built-in and MCP tools
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions