feat: Add LLM-as-a-judge for Anonymizer replace evaluation by memadi-nv · Pull Request #158 · NVIDIA-NeMo/Anonymizer

memadi-nv · 2026-05-14T19:34:53Z

Summary

Adds a new opt-in step — Anonymizer.evaluate_replace(config, result) — that scores a preview() / run() output along four metrics, plus the supporting workflows, prompts, schemas, and display rendering. Evaluation is intentionally separate from anonymization so the user can iterate on judges without re-paying the detection + replacement cost.

What the evaluation looks like

For Substitute, four LLM-as-judge metrics run per record, each surfaced as a "LLM alignment score" with a colored verdict badge:

Detection Validity — were the detected entities real PII (no false positives, mislabels, or boundary errors)?
Type Fidelity — does each synthetic preserve the entity class and format?
Relational Consistency — do cross-entity relationships (city↔country, dob↔age, name↔email) stay coherent?
Attribute Fidelity — do salient within-entity attributes (gender of name, age bucket) survive?

For non-Substitute strategies (Annotate / Redact / Hash), only Detection Validity runs — the other three require a replacement map.

Public API

preview = anonymizer.preview(config=cfg, data=src, num_records=15)   # detection + replacement, no judges
evaluated = anonymizer.evaluate_replace(config=cfg, result=preview)  # opt-in evaluation
evaluated.display_record(0)

result carries both the trace dataframe and resolved_text_column, so the caller never types the text-column name twice. Pickle round-trips of preview / result preserve everything — supports the "anonymize once, re-evaluate many times" workflow.

Runtime efficiency

The evaluation step is the LLM-heavy part of this PR, so a few choices keep wall time as low as possible:

One DataDesigner workflow, four columns. All four judges are added as columns of a single adapter.run_workflow(...) call; DD schedules the LLM calls concurrently within that workflow up to the model's max_parallel_requests. No Python threads, no shared-state races.
Per-judge passthrough at flatten time. Each judge's postprocess() overrides trivially-passthrough rows (e.g., 0 entities, < 2 replacements) with default verdicts so degenerate rows don't muddy the score.
No judge work during anonymization. preview() / run() no longer fire any judges — that 4× LLM cost is paid only when the user explicitly asks for evaluation.
Failure isolation preserved. If one judge column's structured-output parse fails for a row, only that metric becomes Unavailable; the other three still report.
Display is robust to inconsistent LLM responses. When the LLM marks a row valid=False but enumerates no specific failures, the badge renders Not Satisfied (red) rather than letting the count math produce a misleading green.

Demo notebook

docs/notebooks/evaluate-replace-demo.ipynb walks through the new flow end-to-end on the biographies dataset:

Two-step workflow — preview() first (no judges, cheap), then evaluate_replace(result=preview) (judges only when asked).
Programmatic access to the four _valid / invalid columns.
Save-and-reload pattern via pickle for iterating on judges across sessions without re-running detection/replacement.
Non-Substitute (Redact) example showing only Detection Validity surfaces.

Test plan

4 new test files (one per judge) — passthrough behavior, malformed-payload tolerance, schema/alias plumbing.
Extended test_replace_runner.py — single-DD-workflow dispatch, detection-only path for non-Substitute, error on missing trace columns.
*_verdict_badge * unit tests covering all four color paths including the new valid=False, correct==total precedence rule.
818 tests pass locally; pipeline validated end-to-end via the demo notebook.

Files (21 changed, +4400 / −32)

New: four per-judge workflows (detection_judge.py, type_fidelity_judge.py, relational_consistency_judge.py, attribute_fidelity_judge.py) and their tests; demo notebook (evaluate-replace-demo.ipynb).
Modified: anonymizer.py (new evaluate_replace method), replace_runner.py (judge dispatch via single DD workflow), display.py (four metric sections + verdict badge), constants.py, replace.yaml, config/models.py.

Type of Change

Testing

make test passes locally
make check passes locally (format + lint + typecheck + lock-check)
Added/updated tests for changes

Documentation

If docs changed: make docs-build passes locally

Related Issues

Closes #98

Signed-off-by: memadi <memadi@nvidia.com>

greptile-apps · 2026-05-14T19:38:55Z

Greptile Summary

This PR adds an opt-in evaluate_replace() method to the Anonymizer that runs four LLM-as-judge metrics (Detection Validity, Type Fidelity, Relational Consistency, Attribute Fidelity) on an already-replaced result, keeping evaluation cost separate from the anonymization pipeline. The implementation dispatches all active judges as columns of a single DataDesigner workflow call so DD can parallelize them internally, with each judge owning its own prepare / column_config / postprocess decomposition.

New judge modules: four judge workflow classes (detection_judge.py, type_fidelity_judge.py, relational_consistency_judge.py, attribute_fidelity_judge.py) each with structured Pydantic output schemas and malformed-payload tolerance that surfaces "Unavailable" rather than fabricating a verdict.
evaluate_replace public API: new method on Anonymizer that reverses output-column renames, dispatches the merged judge workflow, and re-renames results — preserving resolved_text_column so callers never need to type it twice.
Display additions: four colored verdict badge sections rendered in display_record(), with _verdict_badge correctly handling the valid=False + zero enumerated failures edge case via explicit boolean precedence over count math.

Confidence Score: 4/5

Safe to merge after fixing the Detection Judge section heading that renders unconditionally in the replace display template even when no evaluation has been run.

All four judge modules are well-isolated, error-tolerant, and tested. The evaluate_replace wiring, _unrename_output_columns, and the merged single-workflow dispatch are correct. One concrete display bug exists: the Detection Judge outer wrapper div and heading are hardcoded in the template string rather than returned by _render_detection_judge_section, so any call to display_record() on a non-evaluated Replace result renders an empty Detection Judge section.

src/anonymizer/interface/display.py — the _REPLACE_TEMPLATE string hardcodes the Detection Judge section container, unlike the other three judge sections which self-manage their outer divs.

Important Files Changed

Filename	Overview
src/anonymizer/interface/display.py	Adds four judge verdict sections to the Replace display; the Detection Judge section has a template bug where its outer wrapper is always rendered, creating an empty section for non-evaluated results.
src/anonymizer/engine/replace/replace_runner.py	Adds evaluate() method dispatching all 4 judges as columns of a single DataDesigner workflow; run() now cleanly returns without invoking any judges; error handling correctly falls back to Unavailable verdicts on workflow failure.
src/anonymizer/interface/anonymizer.py	Adds evaluate_replace() method that re-runs LLM judges on a prior result's trace_dataframe; correctly reverses column renames before judge dispatch; judge verdict columns added to _build_user_dataframe allowlist.
src/anonymizer/config/models.py	New judge model fields are Optional with model_validator defaulting to replacement_generator, ensuring backward compatibility with existing configs.
src/anonymizer/engine/replace/detection_judge.py	New detection judge with prepare/column_config/postprocess decomposition for merged DD workflow; passthrough correctly handles rows with no detected entities; malformed LLM output returns Unavailable.
src/anonymizer/engine/replace/type_fidelity_judge.py	New type fidelity judge; prompt correctly anchors format/type checks to the original value's shape; passthrough for empty replacement maps.
src/anonymizer/engine/replace/relational_consistency_judge.py	New relational consistency judge; passthrough threshold correctly set at fewer than 2 replacements; raw judge output preserved for display denominator.
src/anonymizer/engine/replace/attribute_fidelity_judge.py	New attribute fidelity judge checking gender-of-name and age-bucket preservation; prompt appropriately restricted to only two attributes to avoid scope creep.
src/anonymizer/engine/constants.py	Adds 12 new column constants for 4 judge metrics following established naming conventions.
tests/engine/test_replace_runner.py	New tests verify single DD workflow dispatch for merged judges, separation of run() from judges, and proper error on missing required columns.

Sequence Diagram

sequenceDiagram
    participant User
    participant Anonymizer
    participant ReplacementWorkflow
    participant NddAdapter
    participant DD as DataDesigner

    User->>Anonymizer: preview(config, data, num_records)
    Anonymizer->>ReplacementWorkflow: run(df, replace_method, ...)
    ReplacementWorkflow->>NddAdapter: run_workflow (LLM replace only)
    NddAdapter->>DD: generate replacement map
    DD-->>NddAdapter: replacement map
    NddAdapter-->>ReplacementWorkflow: dataframe
    ReplacementWorkflow-->>Anonymizer: ReplacementResult (no judge columns)
    Anonymizer-->>User: PreviewResult (trace_dataframe)

    User->>Anonymizer: "evaluate_replace(config, result=preview)"
    Anonymizer->>Anonymizer: _unrename_output_columns(trace_dataframe)
    Anonymizer->>ReplacementWorkflow: evaluate(internal_df, replace_method, ...)
    ReplacementWorkflow->>ReplacementWorkflow: _run_merged_judges()
    Note over ReplacementWorkflow: prepare() all active judges
    ReplacementWorkflow->>NddAdapter: "run_workflow(columns=[detection, type_fidelity, relational, attribute])"
    NddAdapter->>DD: single workflow, 4 parallel columns
    DD-->>NddAdapter: 4 judge raw outputs
    NddAdapter-->>ReplacementWorkflow: dataframe with raw judge columns
    Note over ReplacementWorkflow: postprocess() each judge → valid/invalid columns
    ReplacementWorkflow-->>Anonymizer: ReplacementResult (with judge columns)
    Anonymizer->>Anonymizer: _rename_output_columns + _build_user_dataframe
    Anonymizer-->>User: AnonymizerResult (with verdict columns)

_{Reviews (2): Last reviewed commit: "nit" | Re-trigger Greptile}

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv added 9 commits May 11, 2026 17:27

add entity detection validation

856d9b7

Signed-off-by: memadi <memadi@nvidia.com>

add type fidelity metric

8a8636e

Signed-off-by: memadi <memadi@nvidia.com>

add relational consistency metric

6c3ae51

Signed-off-by: memadi <memadi@nvidia.com>

add attribute fidelity metric

7946920

Signed-off-by: memadi <memadi@nvidia.com>

update prompts

8776b53

Signed-off-by: memadi <memadi@nvidia.com>

disp;ay replacement map

dfad43b

Signed-off-by: memadi <memadi@nvidia.com>

update metric display

96c297a

Signed-off-by: memadi <memadi@nvidia.com>

more specific prompt

186d311

Signed-off-by: memadi <memadi@nvidia.com>

change judge models for sparce error

63592ad

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv requested a review from a team as a code owner May 14, 2026 19:34

memadi-nv changed the title ~~feature : Add LLM-as-a-judge for REPLACE evaluation~~ feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation May 14, 2026

memadi-nv marked this pull request as draft May 14, 2026 19:37

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/anonymizer/config/models.py

Comment thread src/anonymizer/interface/display.py

Comment thread src/anonymizer/config/default_model_configs/replace.yaml Outdated

memadi-nv added 10 commits May 20, 2026 11:49

nit-update namings in metric

8997e2c

Signed-off-by: memadi <memadi@nvidia.com>

add a toggle for replace evaluation

f203d96

Signed-off-by: memadi <memadi@nvidia.com>

format-nit

514b1e5

Signed-off-by: memadi <memadi@nvidia.com>

run evaluate judges in parallel

3e38013

Signed-off-by: memadi <memadi@nvidia.com>

seperate evaluate_replace from preview

f6c7d22

Signed-off-by: memadi <memadi@nvidia.com>

merge conflicts

52c2f3a

Signed-off-by: memadi <memadi@nvidia.com>

nit-format

133f069

Signed-off-by: memadi <memadi@nvidia.com>

nit

c04aa0d

Signed-off-by: memadi <memadi@nvidia.com>

address greptile feedback

74735b2

Signed-off-by: memadi <memadi@nvidia.com>

nit

3ad5929

Signed-off-by: memadi <memadi@nvidia.com>

memadi-nv changed the title ~~feature (DRAFT): Add LLM-as-a-judge for REPLACE evaluation~~ feature: Add LLM-as-a-judge for Anonymizer replace evaluation May 23, 2026

memadi-nv marked this pull request as ready for review May 23, 2026 01:19

memadi-nv changed the title ~~feature: Add LLM-as-a-judge for Anonymizer replace evaluation~~ feat: Add LLM-as-a-judge for Anonymizer replace evaluation May 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add LLM-as-a-judge for Anonymizer replace evaluation#158

feat: Add LLM-as-a-judge for Anonymizer replace evaluation#158
memadi-nv wants to merge 19 commits into
mainfrom
memadi/feature/evaluate-replace

memadi-nv commented May 14, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

memadi-nv commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the evaluation looks like

Public API

Runtime efficiency

Demo notebook

Test plan

Type of Change

Testing

Documentation

Related Issues

Uh oh!

greptile-apps Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

memadi-nv commented May 14, 2026 •

edited

Loading

greptile-apps Bot commented May 14, 2026 •

edited

Loading