Conversation
cleanup coding example
…eval_command Remove jsonl-reward-eval command
…integration-example Add standalone Braintrust example
…th-example-and-migrate-math-models Add tests for math format-length example
Add seed-based reproducible evaluation for FrozenLake for rollouts
MCP simulator
Generalize MCP Environment and Policy with Tool Calling
remote rollout
This commit addresses the failing CI test by making two key changes: 1. **Fix MCP server startup**: Use sys.executable instead of "python" command to ensure tests use the correct Python interpreter from the virtual environment. 2. **Fix OpenAI format logging in playback mode**: Override __call__ method in FireworksPolicy to populate conversation_histories with recorded messages during playback mode. This ensures OpenAI format logging works correctly for both live and playback modes. 3. **Improve test reliability**: Modified test_openai_format_logging to use canonical recordings instead of live API calls, making it faster and more reliable while avoiding production dependencies. The root cause was that playback mode wasn't maintaining conversation histories, causing the OpenAI format logging assertion to fail since no entries were being written to the log file.
…sts-for-performance Trim deepcoder tests
- Add mode='pointwise' parameter to @evaluation_test decorator - Enable elegant row-by-row evaluation where core logic is separated from test configuration - Add comprehensive word_count example using pointwise mode with haikus dependency - Update README.md with clean architecture documentation and Mermaid diagram - Show parameterized evaluation components in visual diagram - Include both pointwise and batch mode examples - Add dataset adapter helper for word_count evaluation - Deprecate old @reward_function pattern in favor of pytest-based approach This provides a much more elegant API where users define just the core evaluation logic and everything else (models, datasets, thresholds, rollout strategies) is parameterized in the decorator, with full pytest integration for testing and CI/CD.
Quickstart Example
…debugging configure stdout for pytest debugging
# Conflicts: # README.md
…cated _execute_function to streamline execution of both async and non-async functions.
…n for pointwise and batch modes, updating tests to use 'rows' instead of 'input_dataset' for consistency.
feat: Add pointwise evaluation mode with pytest integration
* refactor rollout processor to accept entire input dataset * run single turn rollouts in parallel
* accept evaluation_test kwargs in decorator and delete "evaluate" * update doc * update docs
* working pytest * comment * fix test * adding tests * full dataset
#10) * Enhance math evaluation scoring by introducing weighted contributions for accuracy (80%) and format compliance (20%) in the test_math_dataset function. * fix test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.