feat: Add pointwise evaluation mode with pytest integration by benjibc · Pull Request #4 · eval-protocol/python-sdk

benjibc · 2025-08-01T07:33:21Z

Summary

This PR introduces a new pointwise evaluation mode that provides an elegant API for writing LLM evaluation functions with full pytest integration.

Key Changes

🎯 New Pointwise Mode

Added mode='pointwise' parameter to @evaluation_test decorator
Enables row-by-row evaluation where users define only the core evaluation logic
Framework handles all infrastructure concerns (models, datasets, rollouts, aggregation)

📊 Clean Architecture

Separation of Concerns: Evaluation logic separate from test configuration
Full Parameterization: Models, datasets, thresholds all configured in decorator
Pytest Integration: Run as regular pytest tests with full reporting and CI/CD

📚 Documentation Overhaul

Completely rewrote README.md with clean, modern structure
Added Mermaid diagram showing parameterized evaluation components
Included both pointwise and batch mode examples
Aggressive cleanup of verbose old examples

🧪 Working Example

Added comprehensive word_count evaluation example using pointwise mode
Includes haikus dependency and dataset adapter
Demonstrates the elegant new pattern with working test

Before vs After

Old Pattern (Deprecated):

@reward_function(id="word_count", requirements=["haikus==0.3.8"])
def evaluate(messages: List[Message], **kwargs) -> EvaluateResult:
    # Mix of evaluation logic AND infrastructure concerns

New Pattern (Recommended):

@evaluation_test(
    input_dataset=["data/sample.jsonl"],
    model=["gpt-4o-mini"],
    input_params=[{"temperature": 0.0}],
    threshold_of_success=0.8,
    rollout_processor=default_single_turn_rollout_processor,
    mode="pointwise",  # 🎯 Key innovation
)
def test_word_count_evaluate(messages: List[Message], **kwargs) -> EvaluateResult:
    # ONLY core evaluation logic - everything else parameterized!

Testing

✅ All tests pass
✅ Word count example runs successfully with pointwise mode
✅ Backward compatibility maintained for batch mode
✅ Full pytest integration working

Migration Path

Old @reward_function pattern still works but is deprecated
New pytest-based approach provides better testing, CI/CD integration, and flexibility
Migration guide included for existing users

This provides a much more elegant and maintainable approach to LLM evaluation functions.

- Add mode='pointwise' parameter to @evaluation_test decorator - Enable elegant row-by-row evaluation where core logic is separated from test configuration - Add comprehensive word_count example using pointwise mode with haikus dependency - Update README.md with clean architecture documentation and Mermaid diagram - Show parameterized evaluation components in visual diagram - Include both pointwise and batch mode examples - Add dataset adapter helper for word_count evaluation - Deprecate old @reward_function pattern in favor of pytest-based approach This provides a much more elegant API where users define just the core evaluation logic and everything else (models, datasets, thresholds, rollout strategies) is parameterized in the decorator, with full pytest integration for testing and CI/CD.

eval_protocol/pytest/pytest_utils.py

dphuang2

Makes sense

tests/pytest/test_pytest_word_count_example.py

dphuang2 · 2025-08-02T02:50:34Z

im going to hijack this

# Conflicts: # README.md

…cated _execute_function to streamline execution of both async and non-async functions.

…and limitations

…n for pointwise and batch modes, updating tests to use 'rows' instead of 'input_dataset' for consistency.

remove redundant tools