Fix Foundry red team double-evaluation, lost results, and clean up output#45328
Merged
slister1001 merged 6 commits intoAzure:mainfrom Feb 24, 2026
Merged
Conversation
The Foundry execution path had two bugs: 1. _handle_baseline_with_foundry_results() overwrote red_team_info entries that evaluation_processor.evaluate() had just populated, wiping out evaluation_result_file and data_file. This caused 'Data file not found' warnings and empty results (0 attack details, default scorecard). 2. Each response was evaluated twice - once by RAIServiceScorer during attack execution, then again by evaluation_processor.evaluate() in post-processing. This doubled latency and API costs. Fix: - Remove redundant evaluation_processor.evaluate() call in Foundry path (scorer already evaluated during attack execution) - Remove _handle_baseline_with_foundry_results call (baseline is already in foundry_results from _group_results_by_strategy) - Add fallback in _result_processor.py to read attack_success and score from JSONL data when eval_result is None (uses scorer results) Before: 0 attack details, default scorecard, double eval calls After: 2 attack details, correct scorecard, single eval pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mprove error logging 1. Change misleading 'No evaluation results available' debug log to accurate message for Foundry scorer path 2. Remove attack strategies (e.g. baseline) from per_testing_criteria_results; only risk categories should appear as testing criteria 3. Exclude attack_success, attack_strategy, and score from results.json metadata output 4. Add run_id and display_name to error logs in mlflow_integration and red_team scan exception handler (from PR Azure#45248), with exc_info=True Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes critical bugs in the Foundry (endpoint-based) red team execution path that caused double evaluation, lost results, and misleading logs. The changes eliminate redundant evaluation passes, prevent baseline data overwrites, improve error context in logs, and clean up metadata leakage to the results output.
Changes:
- Removed redundant post-execution evaluation in Foundry path since RAIServiceScorer evaluates during attack execution (~40 lines, ~109s latency savings)
- Removed duplicate baseline handling that overwrote already-populated results from the execution manager
- Enhanced error logging with run identifiers and stack traces for better debugging
- Fixed metadata filtering to exclude internal scorer fields (attack_success, attack_strategy, score) from results.json
- Corrected per_testing_criteria to only include risk categories, not attack strategies
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| _red_team.py | Removes redundant evaluation call and baseline overwrite handler; adds run context to error logs |
| _result_processor.py | Adds fallback to read scorer results from JSONL when eval_result is None; fixes per_testing_criteria to exclude strategies; excludes scorer metadata from output |
| _mlflow_integration.py | Enhances error logging with run_id, display_name, and exc_info for better failure telemetry |
sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/red_team/_result_processor.py
Outdated
Show resolved
Hide resolved
Remove duplicate inline import of get_harm_severity_level (already imported at module level). Apply black 24.4.0 formatting to all three changed files. Note: these files were already non-compliant with black on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001
added a commit
that referenced
this pull request
Feb 24, 2026
…tput (#45328) * Fix Foundry path double-evaluation and lost results in RedTeam scan The Foundry execution path had two bugs: 1. _handle_baseline_with_foundry_results() overwrote red_team_info entries that evaluation_processor.evaluate() had just populated, wiping out evaluation_result_file and data_file. This caused 'Data file not found' warnings and empty results (0 attack details, default scorecard). 2. Each response was evaluated twice - once by RAIServiceScorer during attack execution, then again by evaluation_processor.evaluate() in post-processing. This doubled latency and API costs. Fix: - Remove redundant evaluation_processor.evaluate() call in Foundry path (scorer already evaluated during attack execution) - Remove _handle_baseline_with_foundry_results call (baseline is already in foundry_results from _group_results_by_strategy) - Add fallback in _result_processor.py to read attack_success and score from JSONL data when eval_result is None (uses scorer results) Before: 0 attack details, default scorecard, double eval calls After: 2 attack details, correct scorecard, single eval pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix misleading logs, remove baseline from criteria, clean metadata, improve error logging 1. Change misleading 'No evaluation results available' debug log to accurate message for Foundry scorer path 2. Remove attack strategies (e.g. baseline) from per_testing_criteria_results; only risk categories should appear as testing criteria 3. Exclude attack_success, attack_strategy, and score from results.json metadata output 4. Add run_id and display_name to error logs in mlflow_integration and red_team scan exception handler (from PR #45248), with exc_info=True Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Remove redundant import and apply black formatting Remove duplicate inline import of get_harm_severity_level (already imported at module level). Apply black 24.4.0 formatting to all three changed files. Note: these files were already non-compliant with black on main. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Merge upstream/main and apply black formatting Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001
added a commit
to slister1001/azure-sdk-for-python
that referenced
this pull request
Feb 25, 2026
- Add 1.15.3 (Unreleased) with red team double-eval fixes from PR Azure#45328 - Add 1.15.2 (2026-02-23) per-line errors fix - Add 1.15.1 (2026-02-19) agent scenario, tokens, run status fixes - Bump _version.py from 1.15.1 to 1.15.3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001
added a commit
that referenced
this pull request
Feb 25, 2026
) - Add 1.15.3 (Unreleased) with red team double-eval fixes from PR #45328 - Add 1.15.2 (2026-02-23) per-line errors fix - Add 1.15.1 (2026-02-19) agent scenario, tokens, run status fixes - Bump _version.py from 1.15.1 to 1.15.3 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes multiple issues in the Foundry (endpoint-based) red team execution path that caused:
Changes
_red_team.py
_result_processor.py
_mlflow_integration.py
Testing
Subsumes logging changes from #45248.
All SDK Contribution checklist: