Fix Foundry red team double-evaluation, lost results, and clean up output by slister1001 · Pull Request #45328 · Azure/azure-sdk-for-python

slister1001 · 2026-02-24T16:02:48Z

Description

Fixes multiple issues in the Foundry (endpoint-based) red team execution path that caused:

Double evaluation - RAIServiceScorer evaluated during attack execution, then evaluation_processor.evaluate() re-evaluated the same results post-execution, doubling latency (~161s to ~52s)
Lost results - _handle_baseline_with_foundry_results() overwrote baseline entries after evaluation results were already populated, causing 0 attack details in the final output
Misleading logs - 'No evaluation results available' debug messages fired on every Foundry path run (by design, scorer results are in JSONL, not in-memory eval_result)
Incorrect per_testing_criteria - Attack strategies (e.g., 'baseline') appeared as testing criteria alongside risk categories
Metadata leak - attack_success, attack_strategy, and score from JSONL leaked into results.json metadata
Missing error context - Error logs lacked run_id/display_name and stack traces (from Improve red team failure telemetry logging #45248)

Changes

_red_team.py

Remove redundant evaluation_processor.evaluate() call in Foundry path (~40 lines removed)
Remove _handle_baseline_with_foundry_results() call - baseline is already present in foundry_results from the execution manager
Add error logging with run_id and exc_info=True in scan exception handler

_result_processor.py

Add fallback in to_red_team_result() to read scorer results (attack_success, score) from JSONL when eval_result is None
Replace misleading 'No evaluation results' log with accurate message for Foundry scorer path
Remove strategy_criteria block from _compute_per_testing_criteria() - only risk categories appear as testing criteria
Add attack_success, attack_strategy, score to metadata exclusion set in _build_output_item()

_mlflow_integration.py

Add run_id and display_name to error log messages in log_redteam_results_to_mlflow() and update_run_status()
Add exc_info=True for full stack traces on errors

Testing

367 unit tests pass (0 failures, 15 skipped)
Enterprise e2e verified against disabled-local-auth and user-assigned-identity Foundry projects:
- 2 attack details returned (was 0 before fix)
- 0% ASR, correct per_testing_criteria (only risk categories)
- No metadata leak in results.json
- Scan time: 52s (was 161s with double eval)

Subsumes logging changes from #45248.

All SDK Contribution checklist:

The pull request does not introduce breaking changes
CHANGELOG is updated for new features, bug fixes or other significant changes. (no CHANGELOG update - this is a bug fix in experimental API)
I have read the contribution guidelines.

The Foundry execution path had two bugs: 1. _handle_baseline_with_foundry_results() overwrote red_team_info entries that evaluation_processor.evaluate() had just populated, wiping out evaluation_result_file and data_file. This caused 'Data file not found' warnings and empty results (0 attack details, default scorecard). 2. Each response was evaluated twice - once by RAIServiceScorer during attack execution, then again by evaluation_processor.evaluate() in post-processing. This doubled latency and API costs. Fix: - Remove redundant evaluation_processor.evaluate() call in Foundry path (scorer already evaluated during attack execution) - Remove _handle_baseline_with_foundry_results call (baseline is already in foundry_results from _group_results_by_strategy) - Add fallback in _result_processor.py to read attack_success and score from JSONL data when eval_result is None (uses scorer results) Before: 0 attack details, default scorecard, double eval calls After: 2 attack details, correct scorecard, single eval pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…mprove error logging 1. Change misleading 'No evaluation results available' debug log to accurate message for Foundry scorer path 2. Remove attack strategies (e.g. baseline) from per_testing_criteria_results; only risk categories should appear as testing criteria 3. Exclude attack_success, attack_strategy, and score from results.json metadata output 4. Add run_id and display_name to error logs in mlflow_integration and red_team scan exception handler (from PR Azure#45248), with exc_info=True Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR fixes critical bugs in the Foundry (endpoint-based) red team execution path that caused double evaluation, lost results, and misleading logs. The changes eliminate redundant evaluation passes, prevent baseline data overwrites, improve error context in logs, and clean up metadata leakage to the results output.

Changes:

Removed redundant post-execution evaluation in Foundry path since RAIServiceScorer evaluates during attack execution (~40 lines, ~109s latency savings)
Removed duplicate baseline handling that overwrote already-populated results from the execution manager
Enhanced error logging with run identifiers and stack traces for better debugging
Fixed metadata filtering to exclude internal scorer fields (attack_success, attack_strategy, score) from results.json
Corrected per_testing_criteria to only include risk categories, not attack strategies

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
_red_team.py	Removes redundant evaluation call and baseline overwrite handler; adds run context to error logs
_result_processor.py	Adds fallback to read scorer results from JSONL when eval_result is None; fixes per_testing_criteria to exclude strategies; excludes scorer metadata from output
_mlflow_integration.py	Enhances error logging with run_id, display_name, and exc_info for better failure telemetry

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py

Remove duplicate inline import of get_harm_severity_level (already imported at module level). Apply black 24.4.0 formatting to all three changed files. Note: these files were already non-compliant with black on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…-double-eval

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

nagkumar91

LGTM, approving.

…tput (#45328) * Fix Foundry path double-evaluation and lost results in RedTeam scan The Foundry execution path had two bugs: 1. _handle_baseline_with_foundry_results() overwrote red_team_info entries that evaluation_processor.evaluate() had just populated, wiping out evaluation_result_file and data_file. This caused 'Data file not found' warnings and empty results (0 attack details, default scorecard). 2. Each response was evaluated twice - once by RAIServiceScorer during attack execution, then again by evaluation_processor.evaluate() in post-processing. This doubled latency and API costs. Fix: - Remove redundant evaluation_processor.evaluate() call in Foundry path (scorer already evaluated during attack execution) - Remove _handle_baseline_with_foundry_results call (baseline is already in foundry_results from _group_results_by_strategy) - Add fallback in _result_processor.py to read attack_success and score from JSONL data when eval_result is None (uses scorer results) Before: 0 attack details, default scorecard, double eval calls After: 2 attack details, correct scorecard, single eval pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix misleading logs, remove baseline from criteria, clean metadata, improve error logging 1. Change misleading 'No evaluation results available' debug log to accurate message for Foundry scorer path 2. Remove attack strategies (e.g. baseline) from per_testing_criteria_results; only risk categories should appear as testing criteria 3. Exclude attack_success, attack_strategy, and score from results.json metadata output 4. Add run_id and display_name to error logs in mlflow_integration and red_team scan exception handler (from PR #45248), with exc_info=True Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove redundant import and apply black formatting Remove duplicate inline import of get_harm_severity_level (already imported at module level). Apply black 24.4.0 formatting to all three changed files. Note: these files were already non-compliant with black on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Merge upstream/main and apply black formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Add 1.15.3 (Unreleased) with red team double-eval fixes from PR Azure#45328 - Add 1.15.2 (2026-02-23) per-line errors fix - Add 1.15.1 (2026-02-19) agent scenario, tokens, run status fixes - Bump _version.py from 1.15.1 to 1.15.3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

) - Add 1.15.3 (Unreleased) with red team double-eval fixes from PR #45328 - Add 1.15.2 (2026-02-23) per-line errors fix - Add 1.15.1 (2026-02-19) agent scenario, tokens, run status fixes - Bump _version.py from 1.15.1 to 1.15.3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

slister1001 and others added 2 commits February 24, 2026 10:50

Copilot AI review requested due to automatic review settings February 24, 2026 16:02

slister1001 requested a review from a team as a code owner February 24, 2026 16:02

slister1001 added the Evaluation Issues related to the client library for Azure AI Evaluation label Feb 24, 2026

Copilot started reviewing on behalf of slister1001 February 24, 2026 16:04 View session

Copilot AI reviewed Feb 24, 2026

View reviewed changes

sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py Outdated Show resolved Hide resolved

slister1001 and others added 4 commits February 24, 2026 13:03

Merge remote-tracking branch 'upstream/main' into fix/redteam-foundry…

c878aed

…-double-eval

Merge upstream/main and apply black formatting

0768058

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Merge branch 'main' into fix/redteam-foundry-double-eval

5651518

nagkumar91 approved these changes Feb 24, 2026

View reviewed changes

slister1001 merged commit e9ef504 into Azure:main Feb 24, 2026
21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Foundry red team double-evaluation, lost results, and clean up output#45328

Fix Foundry red team double-evaluation, lost results, and clean up output#45328
slister1001 merged 6 commits intoAzure:mainfrom
slister1001:fix/redteam-foundry-double-eval

slister1001 commented Feb 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

nagkumar91 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

slister1001 commented Feb 24, 2026

Description

Changes

Testing

All SDK Contribution checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

nagkumar91 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants