Skip to content

Improve evaluation sample test recordings and LLM validation#45506

Merged
howieleung merged 1 commit intomainfrom
aprilk/eval-sample-recordings
Mar 5, 2026
Merged

Improve evaluation sample test recordings and LLM validation#45506
howieleung merged 1 commit intomainfrom
aprilk/eval-sample-recordings

Conversation

@aprilk-ms
Copy link
Member

@aprilk-ms aprilk-ms commented Mar 4, 2026

Changes (4 files)

1. Replace unsafe inline data in sample (sample_evaluations_builtin_with_inline_data.py)

  • Replaced violent/provocative test data with benign content (health tips, writing skills)
  • The original data triggered LLM content safety filters during validation, causing persistent test failures

2. Improve LLM validation prompt (test_samples_evaluations.py)

  • Rewrote evaluations_instructions prompt to reduce false positives:
    • Clarify that JSON metric counters ("failed": 0, "error": null) are normal output
    • Explicitly state absence of crash means success
    • List specific failure indicators (Python tracebacks, FAILED_EXECUTION)
  • Added _annotate_eval_metric_fields() preprocessor function scoped to evaluation tests only
  • Added 2 missing samples to skip list per PR Howie/recording 7 #45412 review feedback:
    • sample_evaluations_builtin_with_dataset_id.py (requires Blob Storage prerequisite)
    • sample_continuous_evaluation_rule.py (requires manual RBAC assignment)

3. Add generic validation text preprocessing support (sample_executor.py)

  • Added validation_text_preprocessor callback parameter (domain-agnostic) to allow test-specific text transformations before LLM validation

4. Re-recorded all 24 evaluation sample tests (assets.json)

  • All recordings updated with new prompt and safe inline data
  • Tag: python/ai/azure-ai-projects_e4ec8a475a
  • All 24 tests pass in playback (~96s)

Addresses

@aprilk-ms aprilk-ms force-pushed the aprilk/eval-sample-recordings branch 3 times, most recently from 7af80fe to 9b7db14 Compare March 4, 2026 20:27
@aprilk-ms aprilk-ms changed the title Fix evaluation sample test skip list and LLM validation filtering Improve evaluation sample test recordings and LLM validation Mar 4, 2026
@aprilk-ms aprilk-ms force-pushed the aprilk/eval-sample-recordings branch from 9b7db14 to e47ea77 Compare March 4, 2026 20:42
@aprilk-ms aprilk-ms marked this pull request as ready for review March 4, 2026 20:46
Copilot AI review requested due to automatic review settings March 4, 2026 20:46
@aprilk-ms aprilk-ms requested a review from trrwilson as a code owner March 4, 2026 20:46
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the reliability of evaluation sample test recordings and reduces false-positive LLM validation failures. It replaces unsafe/provocative test data in a sample with benign content, rewrites the LLM validation instructions to better handle evaluation-specific output patterns, and introduces a validation_text_preprocessor callback parameter to allow domain-specific text transformations before LLM validation.

Changes:

  • Rewrote the evaluations_instructions prompt and added _annotate_eval_metric_fields() to reduce LLM false positives caused by JSON metric counters like "failed": 0 being mistaken for actual errors
  • Added validation_text_preprocessor: Optional[Callable[[str], str]] parameter to BaseSampleExecutor, SyncSampleExecutor, and AsyncSampleExecutor
  • Replaced violent/provocative inline test data with benign health and writing content; added 2 samples to the skip list; updated asset tag for re-recorded tests

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
test_samples_evaluations.py Rewrote LLM validation prompt, added _annotate_eval_metric_fields preprocessor, added 2 entries to skip list, passed preprocessor to all 3 test executors
sample_executor.py Added validation_text_preprocessor optional parameter to all executor classes; applied it in _build_validation_text; minor refactoring of _capture_print
sample_evaluations_builtin_with_inline_data.py Replaced violent inline test data with safe health/writing content
assets.json Updated recording tag to python/ai/azure-ai-projects_e4ec8a475a for re-recorded tests

You can also share your feedback on Copilot code review. Take the survey.

@aprilk-ms aprilk-ms force-pushed the aprilk/eval-sample-recordings branch 4 times, most recently from 7f7a807 to cde836e Compare March 5, 2026 02:20
- Filter HTTP debug noise from LLM validation text to reduce false positives
- Add metric counter annotations for JSON, pprint, and Python repr formats
- Replace violent inline test data with benign content
- Add missing samples to skip list (dataset_id, continuous_evaluation_rule)
- Re-record all 23 evaluation sample tests with improved preprocessing
- Add allowed_llm_validation_failures for red team test
- Change validation_text_preprocessor to accept list[str] for entry-level filtering

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@aprilk-ms aprilk-ms force-pushed the aprilk/eval-sample-recordings branch from cde836e to 1a20f62 Compare March 5, 2026 02:28
@howieleung howieleung merged commit 64c7acb into main Mar 5, 2026
20 checks passed
@howieleung howieleung deleted the aprilk/eval-sample-recordings branch March 5, 2026 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants