feat: Add LLM-as-a-judge for Anonymizer replace evaluation#158
feat: Add LLM-as-a-judge for Anonymizer replace evaluation#158memadi-nv wants to merge 19 commits into
Conversation
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Greptile SummaryThis PR adds an opt-in
Confidence Score: 4/5Safe to merge after fixing the Detection Judge section heading that renders unconditionally in the replace display template even when no evaluation has been run. All four judge modules are well-isolated, error-tolerant, and tested. The evaluate_replace wiring, _unrename_output_columns, and the merged single-workflow dispatch are correct. One concrete display bug exists: the Detection Judge outer wrapper div and heading are hardcoded in the template string rather than returned by _render_detection_judge_section, so any call to display_record() on a non-evaluated Replace result renders an empty Detection Judge section. src/anonymizer/interface/display.py — the _REPLACE_TEMPLATE string hardcodes the Detection Judge section container, unlike the other three judge sections which self-manage their outer divs. Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant Anonymizer
participant ReplacementWorkflow
participant NddAdapter
participant DD as DataDesigner
User->>Anonymizer: preview(config, data, num_records)
Anonymizer->>ReplacementWorkflow: run(df, replace_method, ...)
ReplacementWorkflow->>NddAdapter: run_workflow (LLM replace only)
NddAdapter->>DD: generate replacement map
DD-->>NddAdapter: replacement map
NddAdapter-->>ReplacementWorkflow: dataframe
ReplacementWorkflow-->>Anonymizer: ReplacementResult (no judge columns)
Anonymizer-->>User: PreviewResult (trace_dataframe)
User->>Anonymizer: "evaluate_replace(config, result=preview)"
Anonymizer->>Anonymizer: _unrename_output_columns(trace_dataframe)
Anonymizer->>ReplacementWorkflow: evaluate(internal_df, replace_method, ...)
ReplacementWorkflow->>ReplacementWorkflow: _run_merged_judges()
Note over ReplacementWorkflow: prepare() all active judges
ReplacementWorkflow->>NddAdapter: "run_workflow(columns=[detection, type_fidelity, relational, attribute])"
NddAdapter->>DD: single workflow, 4 parallel columns
DD-->>NddAdapter: 4 judge raw outputs
NddAdapter-->>ReplacementWorkflow: dataframe with raw judge columns
Note over ReplacementWorkflow: postprocess() each judge → valid/invalid columns
ReplacementWorkflow-->>Anonymizer: ReplacementResult (with judge columns)
Anonymizer->>Anonymizer: _rename_output_columns + _build_user_dataframe
Anonymizer-->>User: AnonymizerResult (with verdict columns)
Reviews (2): Last reviewed commit: "nit" | Re-trigger Greptile |
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Signed-off-by: memadi <memadi@nvidia.com>
Summary
Adds a new opt-in step —
Anonymizer.evaluate_replace(config, result)— that scores apreview()/run()output along four metrics, plus the supporting workflows, prompts, schemas, and display rendering. Evaluation is intentionally separate from anonymization so the user can iterate on judges without re-paying the detection + replacement cost.What the evaluation looks like
For Substitute, four LLM-as-judge metrics run per record, each surfaced as a "LLM alignment score" with a colored verdict badge:
For non-Substitute strategies (Annotate / Redact / Hash), only Detection Validity runs — the other three require a replacement map.
Public API
Runtime efficiency
The evaluation step is the LLM-heavy part of this PR, so a few choices keep wall time as low as possible:
Demo notebook
docs/notebooks/evaluate-replace-demo.ipynb walks through the new flow end-to-end on the biographies dataset:
Test plan
Files (21 changed, +4400 / −32)
Type of Change
Testing
make testpasses locallymake checkpasses locally (format + lint + typecheck + lock-check)Documentation
make docs-buildpasses locallyRelated Issues
Closes #98