Improve evaluation sample test recordings and LLM validation by aprilk-ms · Pull Request #45506 · Azure/azure-sdk-for-python

aprilk-ms · 2026-03-04T05:06:54Z

Changes (4 files)

1. Replace unsafe inline data in sample (`sample_evaluations_builtin_with_inline_data.py`)

Replaced violent/provocative test data with benign content (health tips, writing skills)
The original data triggered LLM content safety filters during validation, causing persistent test failures

2. Improve LLM validation prompt (`test_samples_evaluations.py`)

Rewrote evaluations_instructions prompt to reduce false positives:
- Clarify that JSON metric counters ("failed": 0, "error": null) are normal output
- Explicitly state absence of crash means success
- List specific failure indicators (Python tracebacks, FAILED_EXECUTION)
Added _annotate_eval_metric_fields() preprocessor function scoped to evaluation tests only
Added 2 missing samples to skip list per PR Howie/recording 7 #45412 review feedback:
- sample_evaluations_builtin_with_dataset_id.py (requires Blob Storage prerequisite)
- sample_continuous_evaluation_rule.py (requires manual RBAC assignment)

3. Add generic validation text preprocessing support (`sample_executor.py`)

Added validation_text_preprocessor callback parameter (domain-agnostic) to allow test-specific text transformations before LLM validation

4. Re-recorded all 24 evaluation sample tests (`assets.json`)

All recordings updated with new prompt and safe inline data
Tag: python/ai/azure-ai-projects_e4ec8a475a
All 24 tests pass in playback (~96s)

Addresses

Review feedback from PR Howie/recording 7 #45412

Copilot

Pull request overview

This PR improves the reliability of evaluation sample test recordings and reduces false-positive LLM validation failures. It replaces unsafe/provocative test data in a sample with benign content, rewrites the LLM validation instructions to better handle evaluation-specific output patterns, and introduces a validation_text_preprocessor callback parameter to allow domain-specific text transformations before LLM validation.

Changes:

Rewrote the evaluations_instructions prompt and added _annotate_eval_metric_fields() to reduce LLM false positives caused by JSON metric counters like "failed": 0 being mistaken for actual errors
Added validation_text_preprocessor: Optional[Callable[[str], str]] parameter to BaseSampleExecutor, SyncSampleExecutor, and AsyncSampleExecutor
Replaced violent/provocative inline test data with benign health and writing content; added 2 samples to the skip list; updated asset tag for re-recorded tests

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
`test_samples_evaluations.py`	Rewrote LLM validation prompt, added `_annotate_eval_metric_fields` preprocessor, added 2 entries to skip list, passed preprocessor to all 3 test executors
`sample_executor.py`	Added `validation_text_preprocessor` optional parameter to all executor classes; applied it in `_build_validation_text`; minor refactoring of `_capture_print`
`sample_evaluations_builtin_with_inline_data.py`	Replaced violent inline test data with safe health/writing content
`assets.json`	Updated recording tag to `python/ai/azure-ai-projects_e4ec8a475a` for re-recorded tests

You can also share your feedback on Copilot code review. Take the survey.

sdk/ai/azure-ai-projects/tests/samples/sample_executor.py

sdk/ai/azure-ai-projects/tests/samples/test_samples_evaluations.py

- Filter HTTP debug noise from LLM validation text to reduce false positives - Add metric counter annotations for JSON, pprint, and Python repr formats - Replace violent inline test data with benign content - Add missing samples to skip list (dataset_id, continuous_evaluation_rule) - Re-record all 23 evaluation sample tests with improved preprocessing - Add allowed_llm_validation_failures for red team test - Change validation_text_preprocessor to accept list[str] for entry-level filtering Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the AI Projects label Mar 4, 2026

aprilk-ms force-pushed the aprilk/eval-sample-recordings branch 3 times, most recently from 7af80fe to 9b7db14 Compare March 4, 2026 20:27

aprilk-ms changed the title ~~Fix evaluation sample test skip list and LLM validation filtering~~ Improve evaluation sample test recordings and LLM validation Mar 4, 2026

aprilk-ms force-pushed the aprilk/eval-sample-recordings branch from 9b7db14 to e47ea77 Compare March 4, 2026 20:42

aprilk-ms marked this pull request as ready for review March 4, 2026 20:46

aprilk-ms requested review from dargilco, glharper, howieleung, kingernupur, nick863 and trangevi as code owners March 4, 2026 20:46

Copilot AI review requested due to automatic review settings March 4, 2026 20:46

aprilk-ms requested a review from trrwilson as a code owner March 4, 2026 20:46

Copilot started reviewing on behalf of aprilk-ms March 4, 2026 20:47 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

sdk/ai/azure-ai-projects/tests/samples/sample_executor.py Outdated Show resolved Hide resolved

sdk/ai/azure-ai-projects/tests/samples/test_samples_evaluations.py Outdated Show resolved Hide resolved

howieleung reviewed Mar 4, 2026

View reviewed changes

sdk/ai/azure-ai-projects/tests/samples/test_samples_evaluations.py Outdated Show resolved Hide resolved

aprilk-ms force-pushed the aprilk/eval-sample-recordings branch 4 times, most recently from 7f7a807 to cde836e Compare March 5, 2026 02:20

aprilk-ms force-pushed the aprilk/eval-sample-recordings branch from cde836e to 1a20f62 Compare March 5, 2026 02:28

howieleung approved these changes Mar 5, 2026

View reviewed changes

howieleung merged commit 64c7acb into main Mar 5, 2026
20 checks passed

howieleung deleted the aprilk/eval-sample-recordings branch March 5, 2026 03:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve evaluation sample test recordings and LLM validation#45506

Improve evaluation sample test recordings and LLM validation#45506
howieleung merged 1 commit intomainfrom
aprilk/eval-sample-recordings

aprilk-ms commented Mar 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aprilk-ms commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes (4 files)

1. Replace unsafe inline data in sample (sample_evaluations_builtin_with_inline_data.py)

2. Improve LLM validation prompt (test_samples_evaluations.py)

3. Add generic validation text preprocessing support (sample_executor.py)

4. Re-recorded all 24 evaluation sample tests (assets.json)

Addresses

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aprilk-ms commented Mar 4, 2026 •

edited

Loading

1. Replace unsafe inline data in sample (`sample_evaluations_builtin_with_inline_data.py`)

2. Improve LLM validation prompt (`test_samples_evaluations.py`)

3. Add generic validation text preprocessing support (`sample_executor.py`)

4. Re-recorded all 24 evaluation sample tests (`assets.json`)