Summary
Enhance cpex-pii-filter with an optional broad, model-backed detection mode for PII that current deterministic Rust regex/checksum detectors intentionally miss. The current plugin is fast and auditable, but limited to explicit patterns and context labels. We should evaluate a slower, broader detector path for names, addresses, dates, URLs, secrets, regional identifiers, and domain-specific text where regex coverage is brittle.
This should be additive. Keep the existing Rust detector as the default fast path, and add a configurable mode that can run after or alongside it.
Motivation
plugins/rust/python-package/pii_filter currently covers common structured identifiers such as SSN, BSN, credit cards, email, phones, IP addresses, DOB labels, passports, driver licenses, bank accounts, medical records, custom regexes, and whitelists. That is good for low-latency deterministic filtering, but it misses broader contextual PII such as:
- natural-language names and addresses
- non-US / non-Dutch regional identifiers
- loosely formatted dates and private URLs
- secrets or identifiers split across nearby text
- domain-specific PII where labels vary by customer or geography
A broader detector can trade latency and model footprint for recall.
Candidate approaches
Option A: OpenAI Privacy Filter
- Model: https://huggingface.co/openai/privacy-filter
- Source: https://github.com/openai/privacy-filter
- License: Apache-2.0.
- Local/on-prem token-classification model intended for PII detection and masking.
- Detects 8 span categories:
account_number, private_address, private_email, private_person, private_phone, private_url, private_date, and secret.
- Supports long context and runtime precision/recall calibration.
- Tradeoffs:
- Good fit for broad text sanitization.
- Model is Python-first today. Direct Rust support may require ONNX/export work or a Python subprocess/library bridge.
- Label taxonomy is fixed unless fine-tuned, so customer-specific policy still needs config and evaluation.
- Needs explicit production guidance: model cache path, offline mode, no network at runtime, CPU/GPU behavior, timeout/fallback.
Option B: Microsoft Presidio
- Docs: https://microsoft.github.io/presidio/
- Mature Python SDK/service for PII analysis and anonymization.
- Supports predefined/custom recognizers using NER, regex, rules, checksums, context, multiple languages, external models, and image/structured-data flows.
- Tradeoffs:
- Broad, proven design and easy extensibility.
- Python-first; likely best as an optional Python/service backend rather than a Rust dependency.
- Operationally heavier: spaCy/Stanza/transformer models, service lifecycle, language packages, and dependency footprint.
Option C: Rust-native / Rust-adjacent alternatives to evaluate
pii crate: https://docs.rs/pii/latest/pii/
- Deterministic PII detection/anonymization pipeline with stable byte offsets, recognizers, policy config, and optional Candle-based NER hooks.
- Worth evaluating for Rust-native analyzer/anonymizer abstractions or recognizer expansion.
pii-vault crate: https://docs.rs/crate/pii-vault/0.1.0/source/README.md
- Presidio-compatible detection/anonymization/tokenization; advertises 40+ entity types and 29 recognizers across multiple countries.
- Worth evaluating for broader deterministic coverage and reversible tokenization ideas.
rust-bert: https://docs.rs/rust-bert/latest/rust_bert/
- Rust NLP pipelines with token classification / NER support via
tch or ONNX Runtime bindings.
- Worth evaluating if we choose a Rust-hosted transformer path for NER or custom token classification models.
edge-transformers: https://docs.rs/edge-transformers/latest/edge_transformers/
- Rust implementation of Hugging Face-style pipelines on ONNX Runtime, including token classification.
- Worth evaluating only if the maintenance/packaging story is acceptable.
ort: https://docs.rs/ort/latest/ort/
- Lower-level ONNX Runtime bindings. Potential path for hosting an exported token-classification model directly from Rust.
iron_safety: https://docs.rs/iron_safety/latest/iron_safety/
- Low-latency PII redaction crate for LLM outputs, but currently narrow coverage. Probably not enough as the main solution.
Proposed design direction
Add a pluggable backend layer behind the existing detector:
broad_detection:
enabled: false
backend: "privacy_filter" # privacy_filter | presidio | rust_ner | none
mode: "augment" # augment existing detections; existing Rust detector remains default
min_score: 0.75
fail_closed: false
timeout_ms: 1000
model_path: null # local path only; no implicit runtime download unless explicitly allowed
allow_model_download: false
Detection flow:
- Run existing Rust detector first.
- If
broad_detection.enabled, run selected backend with timeout and byte limits.
- Normalize backend spans into the existing detection structure: type, start, end, score, source, explanation.
- Merge spans using existing deterministic overlap rules, with a documented source priority.
- Apply current masking strategies unchanged.
- Preserve privacy defaults: no detection details in metadata unless
include_detection_details: true; no detection logging unless log_detections: true.
Research tasks
- Compare OpenAI Privacy Filter, Presidio, and Rust-native options against this repo's packaging constraints.
- Confirm whether OpenAI Privacy Filter can be exported or consumed through ONNX/Transformers-compatible tooling without adding a mandatory Python runtime path.
- Measure cold-start latency, steady-state latency, memory footprint, and binary/wheel size impact for each viable backend.
- Confirm licenses and transitive dependency risk.
- Decide whether the first milestone should be:
- Python bridge to OpenAI Privacy Filter,
- optional Presidio service/client backend,
- Rust ONNX/token-classification backend,
- or broader deterministic Rust recognizers before model support.
Acceptance criteria
- Existing
cpex-pii-filter behavior and defaults remain unchanged when broad detection is disabled.
- New backend config is documented in README and
plugin-manifest.yaml.
- Broad detector can be enabled explicitly and can redact at least person names, private addresses, private dates, private URLs, and secrets in representative text fixtures.
- Backend failures, timeout, missing model files, and unavailable optional dependencies have deterministic behavior based on
fail_closed.
- Detection details expose backend/source and confidence only when metadata details are enabled.
- Tests cover:
- disabled-by-default behavior,
- broad detection happy path,
- overlap handling between regex and model detections,
- timeout/failure behavior,
- no runtime network access unless explicitly enabled.
- Benchmarks compare existing Rust-only detection vs broad mode on small, medium, and large payloads.
- Documentation states this is a privacy aid, not a compliance or anonymization guarantee, and recommends in-domain evaluation before production.
Open questions
- Should the broad detector live inside
pii_filter or as a separate plugin so operators can choose the heavier dependency footprint explicitly?
- Is a local Python backend acceptable for the first implementation, or should we require Rust-hosted inference from the start?
- Should broad mode default to augmenting deterministic detections, or should it be allowed to replace them?
- What minimum supported platforms/wheel targets can include model runtime dependencies?
Summary
Enhance
cpex-pii-filterwith an optional broad, model-backed detection mode for PII that current deterministic Rust regex/checksum detectors intentionally miss. The current plugin is fast and auditable, but limited to explicit patterns and context labels. We should evaluate a slower, broader detector path for names, addresses, dates, URLs, secrets, regional identifiers, and domain-specific text where regex coverage is brittle.This should be additive. Keep the existing Rust detector as the default fast path, and add a configurable mode that can run after or alongside it.
Motivation
plugins/rust/python-package/pii_filtercurrently covers common structured identifiers such as SSN, BSN, credit cards, email, phones, IP addresses, DOB labels, passports, driver licenses, bank accounts, medical records, custom regexes, and whitelists. That is good for low-latency deterministic filtering, but it misses broader contextual PII such as:A broader detector can trade latency and model footprint for recall.
Candidate approaches
Option A: OpenAI Privacy Filter
account_number,private_address,private_email,private_person,private_phone,private_url,private_date, andsecret.Option B: Microsoft Presidio
Option C: Rust-native / Rust-adjacent alternatives to evaluate
piicrate: https://docs.rs/pii/latest/pii/pii-vaultcrate: https://docs.rs/crate/pii-vault/0.1.0/source/README.mdrust-bert: https://docs.rs/rust-bert/latest/rust_bert/tchor ONNX Runtime bindings.edge-transformers: https://docs.rs/edge-transformers/latest/edge_transformers/ort: https://docs.rs/ort/latest/ort/iron_safety: https://docs.rs/iron_safety/latest/iron_safety/Proposed design direction
Add a pluggable backend layer behind the existing detector:
Detection flow:
broad_detection.enabled, run selected backend with timeout and byte limits.include_detection_details: true; no detection logging unlesslog_detections: true.Research tasks
Acceptance criteria
cpex-pii-filterbehavior and defaults remain unchanged when broad detection is disabled.plugin-manifest.yaml.fail_closed.Open questions
pii_filteror as a separate plugin so operators can choose the heavier dependency footprint explicitly?