Skip to content

feat: Add pointwise evaluation mode with pytest integration#4

Merged
dphuang2 merged 9 commits intomainfrom
pytest_for_pointwise
Aug 3, 2025
Merged

feat: Add pointwise evaluation mode with pytest integration#4
dphuang2 merged 9 commits intomainfrom
pytest_for_pointwise

Conversation

@benjibc
Copy link
Contributor

@benjibc benjibc commented Aug 1, 2025

Summary

This PR introduces a new pointwise evaluation mode that provides an elegant API for writing LLM evaluation functions with full pytest integration.

Key Changes

🎯 New Pointwise Mode

  • Added mode='pointwise' parameter to @evaluation_test decorator
  • Enables row-by-row evaluation where users define only the core evaluation logic
  • Framework handles all infrastructure concerns (models, datasets, rollouts, aggregation)

📊 Clean Architecture

  • Separation of Concerns: Evaluation logic separate from test configuration
  • Full Parameterization: Models, datasets, thresholds all configured in decorator
  • Pytest Integration: Run as regular pytest tests with full reporting and CI/CD

📚 Documentation Overhaul

  • Completely rewrote README.md with clean, modern structure
  • Added Mermaid diagram showing parameterized evaluation components
  • Included both pointwise and batch mode examples
  • Aggressive cleanup of verbose old examples

🧪 Working Example

  • Added comprehensive word_count evaluation example using pointwise mode
  • Includes haikus dependency and dataset adapter
  • Demonstrates the elegant new pattern with working test

Before vs After

Old Pattern (Deprecated):

@reward_function(id="word_count", requirements=["haikus==0.3.8"])
def evaluate(messages: List[Message], **kwargs) -> EvaluateResult:
    # Mix of evaluation logic AND infrastructure concerns

New Pattern (Recommended):

@evaluation_test(
    input_dataset=["data/sample.jsonl"],
    model=["gpt-4o-mini"],
    input_params=[{"temperature": 0.0}],
    threshold_of_success=0.8,
    rollout_processor=default_single_turn_rollout_processor,
    mode="pointwise",  # 🎯 Key innovation
)
def test_word_count_evaluate(messages: List[Message], **kwargs) -> EvaluateResult:
    # ONLY core evaluation logic - everything else parameterized!

Testing

  • ✅ All tests pass
  • ✅ Word count example runs successfully with pointwise mode
  • ✅ Backward compatibility maintained for batch mode
  • ✅ Full pytest integration working

Migration Path

  • Old @reward_function pattern still works but is deprecated
  • New pytest-based approach provides better testing, CI/CD integration, and flexibility
  • Migration guide included for existing users

This provides a much more elegant and maintainable approach to LLM evaluation functions.

- Add mode='pointwise' parameter to @evaluation_test decorator
- Enable elegant row-by-row evaluation where core logic is separated from test configuration
- Add comprehensive word_count example using pointwise mode with haikus dependency
- Update README.md with clean architecture documentation and Mermaid diagram
- Show parameterized evaluation components in visual diagram
- Include both pointwise and batch mode examples
- Add dataset adapter helper for word_count evaluation
- Deprecate old @reward_function pattern in favor of pytest-based approach

This provides a much more elegant API where users define just the core evaluation logic
and everything else (models, datasets, thresholds, rollout strategies) is parameterized
in the decorator, with full pytest integration for testing and CI/CD.
Copy link
Collaborator

@dphuang2 dphuang2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense

@dphuang2
Copy link
Collaborator

dphuang2 commented Aug 2, 2025

im going to hijack this

Dylan Huang added 8 commits August 3, 2025 10:05
…cated _execute_function to streamline execution of both async and non-async functions.
…n for pointwise and batch modes, updating tests to use 'rows' instead of 'input_dataset' for consistency.
@dphuang2 dphuang2 merged commit 6029271 into main Aug 3, 2025
7 checks passed
@dphuang2 dphuang2 deleted the pytest_for_pointwise branch August 3, 2025 20:21
dphuang2 pushed a commit that referenced this pull request Aug 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants