Skip to content

Fix Foundry red team double-evaluation, lost results, and clean up output#45328

Merged
slister1001 merged 6 commits intoAzure:mainfrom
slister1001:fix/redteam-foundry-double-eval
Feb 24, 2026
Merged

Fix Foundry red team double-evaluation, lost results, and clean up output#45328
slister1001 merged 6 commits intoAzure:mainfrom
slister1001:fix/redteam-foundry-double-eval

Conversation

@slister1001
Copy link
Member

Description

Fixes multiple issues in the Foundry (endpoint-based) red team execution path that caused:

  1. Double evaluation - RAIServiceScorer evaluated during attack execution, then evaluation_processor.evaluate() re-evaluated the same results post-execution, doubling latency (~161s to ~52s)
  2. Lost results - _handle_baseline_with_foundry_results() overwrote baseline entries after evaluation results were already populated, causing 0 attack details in the final output
  3. Misleading logs - 'No evaluation results available' debug messages fired on every Foundry path run (by design, scorer results are in JSONL, not in-memory eval_result)
  4. Incorrect per_testing_criteria - Attack strategies (e.g., 'baseline') appeared as testing criteria alongside risk categories
  5. Metadata leak - attack_success, attack_strategy, and score from JSONL leaked into results.json metadata
  6. Missing error context - Error logs lacked run_id/display_name and stack traces (from Improve red team failure telemetry logging #45248)

Changes

_red_team.py

  • Remove redundant evaluation_processor.evaluate() call in Foundry path (~40 lines removed)
  • Remove _handle_baseline_with_foundry_results() call - baseline is already present in foundry_results from the execution manager
  • Add error logging with run_id and exc_info=True in scan exception handler

_result_processor.py

  • Add fallback in to_red_team_result() to read scorer results (attack_success, score) from JSONL when eval_result is None
  • Replace misleading 'No evaluation results' log with accurate message for Foundry scorer path
  • Remove strategy_criteria block from _compute_per_testing_criteria() - only risk categories appear as testing criteria
  • Add attack_success, attack_strategy, score to metadata exclusion set in _build_output_item()

_mlflow_integration.py

  • Add run_id and display_name to error log messages in log_redteam_results_to_mlflow() and update_run_status()
  • Add exc_info=True for full stack traces on errors

Testing

  • 367 unit tests pass (0 failures, 15 skipped)
  • Enterprise e2e verified against disabled-local-auth and user-assigned-identity Foundry projects:
    • 2 attack details returned (was 0 before fix)
    • 0% ASR, correct per_testing_criteria (only risk categories)
    • No metadata leak in results.json
    • Scan time: 52s (was 161s with double eval)

Subsumes logging changes from #45248.

All SDK Contribution checklist:

  • The pull request does not introduce breaking changes
  • CHANGELOG is updated for new features, bug fixes or other significant changes. (no CHANGELOG update - this is a bug fix in experimental API)
  • I have read the contribution guidelines.

slister1001 and others added 2 commits February 24, 2026 10:50
The Foundry execution path had two bugs:

1. _handle_baseline_with_foundry_results() overwrote red_team_info entries
   that evaluation_processor.evaluate() had just populated, wiping out
   evaluation_result_file and data_file. This caused 'Data file not found'
   warnings and empty results (0 attack details, default scorecard).

2. Each response was evaluated twice - once by RAIServiceScorer during
   attack execution, then again by evaluation_processor.evaluate() in
   post-processing. This doubled latency and API costs.

Fix:
- Remove redundant evaluation_processor.evaluate() call in Foundry path
  (scorer already evaluated during attack execution)
- Remove _handle_baseline_with_foundry_results call (baseline is already
  in foundry_results from _group_results_by_strategy)
- Add fallback in _result_processor.py to read attack_success and score
  from JSONL data when eval_result is None (uses scorer results)

Before: 0 attack details, default scorecard, double eval calls
After: 2 attack details, correct scorecard, single eval pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mprove error logging

1. Change misleading 'No evaluation results available' debug log to accurate
   message for Foundry scorer path
2. Remove attack strategies (e.g. baseline) from per_testing_criteria_results;
   only risk categories should appear as testing criteria
3. Exclude attack_success, attack_strategy, and score from results.json
   metadata output
4. Add run_id and display_name to error logs in mlflow_integration and
   red_team scan exception handler (from PR Azure#45248), with exc_info=True

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings February 24, 2026 16:02
@slister1001 slister1001 requested a review from a team as a code owner February 24, 2026 16:02
@slister1001 slister1001 added the Evaluation Issues related to the client library for Azure AI Evaluation label Feb 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes critical bugs in the Foundry (endpoint-based) red team execution path that caused double evaluation, lost results, and misleading logs. The changes eliminate redundant evaluation passes, prevent baseline data overwrites, improve error context in logs, and clean up metadata leakage to the results output.

Changes:

  • Removed redundant post-execution evaluation in Foundry path since RAIServiceScorer evaluates during attack execution (~40 lines, ~109s latency savings)
  • Removed duplicate baseline handling that overwrote already-populated results from the execution manager
  • Enhanced error logging with run identifiers and stack traces for better debugging
  • Fixed metadata filtering to exclude internal scorer fields (attack_success, attack_strategy, score) from results.json
  • Corrected per_testing_criteria to only include risk categories, not attack strategies

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
_red_team.py Removes redundant evaluation call and baseline overwrite handler; adds run context to error logs
_result_processor.py Adds fallback to read scorer results from JSONL when eval_result is None; fixes per_testing_criteria to exclude strategies; excludes scorer metadata from output
_mlflow_integration.py Enhances error logging with run_id, display_name, and exc_info for better failure telemetry

slister1001 and others added 4 commits February 24, 2026 13:03
Remove duplicate inline import of get_harm_severity_level (already imported
at module level). Apply black 24.4.0 formatting to all three changed files.
Note: these files were already non-compliant with black on main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Member

@nagkumar91 nagkumar91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, approving.

@slister1001 slister1001 merged commit e9ef504 into Azure:main Feb 24, 2026
21 checks passed
slister1001 added a commit that referenced this pull request Feb 24, 2026
…tput (#45328)

* Fix Foundry path double-evaluation and lost results in RedTeam scan

The Foundry execution path had two bugs:

1. _handle_baseline_with_foundry_results() overwrote red_team_info entries
   that evaluation_processor.evaluate() had just populated, wiping out
   evaluation_result_file and data_file. This caused 'Data file not found'
   warnings and empty results (0 attack details, default scorecard).

2. Each response was evaluated twice - once by RAIServiceScorer during
   attack execution, then again by evaluation_processor.evaluate() in
   post-processing. This doubled latency and API costs.

Fix:
- Remove redundant evaluation_processor.evaluate() call in Foundry path
  (scorer already evaluated during attack execution)
- Remove _handle_baseline_with_foundry_results call (baseline is already
  in foundry_results from _group_results_by_strategy)
- Add fallback in _result_processor.py to read attack_success and score
  from JSONL data when eval_result is None (uses scorer results)

Before: 0 attack details, default scorecard, double eval calls
After: 2 attack details, correct scorecard, single eval pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix misleading logs, remove baseline from criteria, clean metadata, improve error logging

1. Change misleading 'No evaluation results available' debug log to accurate
   message for Foundry scorer path
2. Remove attack strategies (e.g. baseline) from per_testing_criteria_results;
   only risk categories should appear as testing criteria
3. Exclude attack_success, attack_strategy, and score from results.json
   metadata output
4. Add run_id and display_name to error logs in mlflow_integration and
   red_team scan exception handler (from PR #45248), with exc_info=True

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Remove redundant import and apply black formatting

Remove duplicate inline import of get_harm_severity_level (already imported
at module level). Apply black 24.4.0 formatting to all three changed files.
Note: these files were already non-compliant with black on main.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Merge upstream/main and apply black formatting

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit to slister1001/azure-sdk-for-python that referenced this pull request Feb 25, 2026
- Add 1.15.3 (Unreleased) with red team double-eval fixes from PR Azure#45328
- Add 1.15.2 (2026-02-23) per-line errors fix
- Add 1.15.1 (2026-02-19) agent scenario, tokens, run status fixes
- Bump _version.py from 1.15.1 to 1.15.3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
slister1001 added a commit that referenced this pull request Feb 25, 2026
)

- Add 1.15.3 (Unreleased) with red team double-eval fixes from PR #45328
- Add 1.15.2 (2026-02-23) per-line errors fix
- Add 1.15.1 (2026-02-19) agent scenario, tokens, run status fixes
- Bump _version.py from 1.15.1 to 1.15.3

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants