Skip to content

feat: Add LLM-as-a-judge for Anonymizer replace evaluation#158

Open
memadi-nv wants to merge 19 commits into
mainfrom
memadi/feature/evaluate-replace
Open

feat: Add LLM-as-a-judge for Anonymizer replace evaluation#158
memadi-nv wants to merge 19 commits into
mainfrom
memadi/feature/evaluate-replace

Conversation

@memadi-nv
Copy link
Copy Markdown
Contributor

@memadi-nv memadi-nv commented May 14, 2026

Summary

Adds a new opt-in step — Anonymizer.evaluate_replace(config, result) — that scores a preview() / run() output along four metrics, plus the supporting workflows, prompts, schemas, and display rendering. Evaluation is intentionally separate from anonymization so the user can iterate on judges without re-paying the detection + replacement cost.

What the evaluation looks like

For Substitute, four LLM-as-judge metrics run per record, each surfaced as a "LLM alignment score" with a colored verdict badge:

  • Detection Validity — were the detected entities real PII (no false positives, mislabels, or boundary errors)?
  • Type Fidelity — does each synthetic preserve the entity class and format?
  • Relational Consistency — do cross-entity relationships (city↔country, dob↔age, name↔email) stay coherent?
  • Attribute Fidelity — do salient within-entity attributes (gender of name, age bucket) survive?

For non-Substitute strategies (Annotate / Redact / Hash), only Detection Validity runs — the other three require a replacement map.

Screenshot 2026-05-22 at 18 17 51

Public API

preview = anonymizer.preview(config=cfg, data=src, num_records=15)   # detection + replacement, no judges
evaluated = anonymizer.evaluate_replace(config=cfg, result=preview)  # opt-in evaluation
evaluated.display_record(0)

result carries both the trace dataframe and resolved_text_column, so the caller never types the text-column name twice. Pickle round-trips of preview / result preserve everythingsupports the "anonymize once, re-evaluate many times" workflow.

Runtime efficiency

The evaluation step is the LLM-heavy part of this PR, so a few choices keep wall time as low as possible:

  • One DataDesigner workflow, four columns. All four judges are added as columns of a single adapter.run_workflow(...) call; DD schedules the LLM calls concurrently within that workflow up to the model's max_parallel_requests. No Python threads, no shared-state races.
  • Per-judge passthrough at flatten time. Each judge's postprocess() overrides trivially-passthrough rows (e.g., 0 entities, < 2 replacements) with default verdicts so degenerate rows don't muddy the score.
  • No judge work during anonymization. preview() / run() no longer fire any judges — that 4× LLM cost is paid only when the user explicitly asks for evaluation.
  • Failure isolation preserved. If one judge column's structured-output parse fails for a row, only that metric becomes Unavailable; the other three still report.
  • Display is robust to inconsistent LLM responses. When the LLM marks a row valid=False but enumerates no specific failures, the badge renders Not Satisfied (red) rather than letting the count math produce a misleading green.

Demo notebook

docs/notebooks/evaluate-replace-demo.ipynb walks through the new flow end-to-end on the biographies dataset:

  • Two-step workflow — preview() first (no judges, cheap), then evaluate_replace(result=preview) (judges only when asked).
  • Programmatic access to the four _valid / invalid columns.
  • Save-and-reload pattern via pickle for iterating on judges across sessions without re-running detection/replacement.
  • Non-Substitute (Redact) example showing only Detection Validity surfaces.

Test plan

  • 4 new test files (one per judge) — passthrough behavior, malformed-payload tolerance, schema/alias plumbing.
  • Extended test_replace_runner.py — single-DD-workflow dispatch, detection-only path for non-Substitute, error on missing trace columns.
  • *_verdict_badge * unit tests covering all four color paths including the new valid=False, correct==total precedence rule.
  • 818 tests pass locally; pipeline validated end-to-end via the demo notebook.

Files (21 changed, +4400 / −32)

  • New: four per-judge workflows (detection_judge.py, type_fidelity_judge.py, relational_consistency_judge.py, attribute_fidelity_judge.py) and their tests; demo notebook (evaluate-replace-demo.ipynb).
  • Modified: anonymizer.py (new evaluate_replace method), replace_runner.py (judge dispatch via single DD workflow), display.py (four metric sections + verdict badge), constants.py, replace.yaml, config/models.py.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

Related Issues

Closes #98

memadi-nv added 9 commits May 11, 2026 17:27
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv requested a review from a team as a code owner May 14, 2026 19:34
@memadi-nv memadi-nv changed the title feature : Add LLM-as-a-judge for REPLACE evaluation feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation May 14, 2026
@memadi-nv memadi-nv marked this pull request as draft May 14, 2026 19:37
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds an opt-in evaluate_replace() method to the Anonymizer that runs four LLM-as-judge metrics (Detection Validity, Type Fidelity, Relational Consistency, Attribute Fidelity) on an already-replaced result, keeping evaluation cost separate from the anonymization pipeline. The implementation dispatches all active judges as columns of a single DataDesigner workflow call so DD can parallelize them internally, with each judge owning its own prepare / column_config / postprocess decomposition.

  • New judge modules: four judge workflow classes (detection_judge.py, type_fidelity_judge.py, relational_consistency_judge.py, attribute_fidelity_judge.py) each with structured Pydantic output schemas and malformed-payload tolerance that surfaces "Unavailable" rather than fabricating a verdict.
  • evaluate_replace public API: new method on Anonymizer that reverses output-column renames, dispatches the merged judge workflow, and re-renames results — preserving resolved_text_column so callers never need to type it twice.
  • Display additions: four colored verdict badge sections rendered in display_record(), with _verdict_badge correctly handling the valid=False + zero enumerated failures edge case via explicit boolean precedence over count math.

Confidence Score: 4/5

Safe to merge after fixing the Detection Judge section heading that renders unconditionally in the replace display template even when no evaluation has been run.

All four judge modules are well-isolated, error-tolerant, and tested. The evaluate_replace wiring, _unrename_output_columns, and the merged single-workflow dispatch are correct. One concrete display bug exists: the Detection Judge outer wrapper div and heading are hardcoded in the template string rather than returned by _render_detection_judge_section, so any call to display_record() on a non-evaluated Replace result renders an empty Detection Judge section.

src/anonymizer/interface/display.py — the _REPLACE_TEMPLATE string hardcodes the Detection Judge section container, unlike the other three judge sections which self-manage their outer divs.

Important Files Changed

Filename Overview
src/anonymizer/interface/display.py Adds four judge verdict sections to the Replace display; the Detection Judge section has a template bug where its outer wrapper is always rendered, creating an empty section for non-evaluated results.
src/anonymizer/engine/replace/replace_runner.py Adds evaluate() method dispatching all 4 judges as columns of a single DataDesigner workflow; run() now cleanly returns without invoking any judges; error handling correctly falls back to Unavailable verdicts on workflow failure.
src/anonymizer/interface/anonymizer.py Adds evaluate_replace() method that re-runs LLM judges on a prior result's trace_dataframe; correctly reverses column renames before judge dispatch; judge verdict columns added to _build_user_dataframe allowlist.
src/anonymizer/config/models.py New judge model fields are Optional with model_validator defaulting to replacement_generator, ensuring backward compatibility with existing configs.
src/anonymizer/engine/replace/detection_judge.py New detection judge with prepare/column_config/postprocess decomposition for merged DD workflow; passthrough correctly handles rows with no detected entities; malformed LLM output returns Unavailable.
src/anonymizer/engine/replace/type_fidelity_judge.py New type fidelity judge; prompt correctly anchors format/type checks to the original value's shape; passthrough for empty replacement maps.
src/anonymizer/engine/replace/relational_consistency_judge.py New relational consistency judge; passthrough threshold correctly set at fewer than 2 replacements; raw judge output preserved for display denominator.
src/anonymizer/engine/replace/attribute_fidelity_judge.py New attribute fidelity judge checking gender-of-name and age-bucket preservation; prompt appropriately restricted to only two attributes to avoid scope creep.
src/anonymizer/engine/constants.py Adds 12 new column constants for 4 judge metrics following established naming conventions.
tests/engine/test_replace_runner.py New tests verify single DD workflow dispatch for merged judges, separation of run() from judges, and proper error on missing required columns.

Sequence Diagram

sequenceDiagram
    participant User
    participant Anonymizer
    participant ReplacementWorkflow
    participant NddAdapter
    participant DD as DataDesigner

    User->>Anonymizer: preview(config, data, num_records)
    Anonymizer->>ReplacementWorkflow: run(df, replace_method, ...)
    ReplacementWorkflow->>NddAdapter: run_workflow (LLM replace only)
    NddAdapter->>DD: generate replacement map
    DD-->>NddAdapter: replacement map
    NddAdapter-->>ReplacementWorkflow: dataframe
    ReplacementWorkflow-->>Anonymizer: ReplacementResult (no judge columns)
    Anonymizer-->>User: PreviewResult (trace_dataframe)

    User->>Anonymizer: "evaluate_replace(config, result=preview)"
    Anonymizer->>Anonymizer: _unrename_output_columns(trace_dataframe)
    Anonymizer->>ReplacementWorkflow: evaluate(internal_df, replace_method, ...)
    ReplacementWorkflow->>ReplacementWorkflow: _run_merged_judges()
    Note over ReplacementWorkflow: prepare() all active judges
    ReplacementWorkflow->>NddAdapter: "run_workflow(columns=[detection, type_fidelity, relational, attribute])"
    NddAdapter->>DD: single workflow, 4 parallel columns
    DD-->>NddAdapter: 4 judge raw outputs
    NddAdapter-->>ReplacementWorkflow: dataframe with raw judge columns
    Note over ReplacementWorkflow: postprocess() each judge → valid/invalid columns
    ReplacementWorkflow-->>Anonymizer: ReplacementResult (with judge columns)
    Anonymizer->>Anonymizer: _rename_output_columns + _build_user_dataframe
    Anonymizer-->>User: AnonymizerResult (with verdict columns)
Loading

Reviews (2): Last reviewed commit: "nit" | Re-trigger Greptile

Comment thread src/anonymizer/config/models.py
Comment thread src/anonymizer/interface/display.py
Comment thread src/anonymizer/config/default_model_configs/replace.yaml Outdated
memadi-nv added 10 commits May 20, 2026 11:49
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
@memadi-nv memadi-nv changed the title feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation feature: Add LLM-as-a-judge for Anonymizer replace evaluation May 23, 2026
@memadi-nv memadi-nv marked this pull request as ready for review May 23, 2026 01:19
@memadi-nv memadi-nv changed the title feature: Add LLM-as-a-judge for Anonymizer replace evaluation feat: Add LLM-as-a-judge for Anonymizer replace evaluation May 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: LLM-as-a-judge for REPLACE evaluation

1 participant