Skip to content

Initial commit for adapters for langfuse and HF#11

Closed
benjibc wants to merge 327 commits intomainfrom
adapters
Closed

Initial commit for adapters for langfuse and HF#11
benjibc wants to merge 327 commits intomainfrom
adapters

Conversation

@benjibc
Copy link
Copy Markdown
Contributor

@benjibc benjibc commented Aug 4, 2025

No description provided.

benjibc and others added 30 commits June 13, 2025 21:35
…eval_command

Remove jsonl-reward-eval command
…integration-example

Add standalone Braintrust example
…th-example-and-migrate-math-models

Add tests for math format-length example
Add seed-based reproducible evaluation for FrozenLake for rollouts
Generalize MCP Environment and Policy with Tool Calling
This commit addresses the failing CI test by making two key changes:

1. **Fix MCP server startup**: Use sys.executable instead of "python" command
   to ensure tests use the correct Python interpreter from the virtual environment.

2. **Fix OpenAI format logging in playback mode**: Override __call__ method in
   FireworksPolicy to populate conversation_histories with recorded messages
   during playback mode. This ensures OpenAI format logging works correctly
   for both live and playback modes.

3. **Improve test reliability**: Modified test_openai_format_logging to use
   canonical recordings instead of live API calls, making it faster and more
   reliable while avoiding production dependencies.

The root cause was that playback mode wasn't maintaining conversation histories,
causing the OpenAI format logging assertion to fail since no entries were being
written to the log file.
benjibc and others added 26 commits August 1, 2025 00:02
…sts-for-performance

Trim deepcoder tests
- Add mode='pointwise' parameter to @evaluation_test decorator
- Enable elegant row-by-row evaluation where core logic is separated from test configuration
- Add comprehensive word_count example using pointwise mode with haikus dependency
- Update README.md with clean architecture documentation and Mermaid diagram
- Show parameterized evaluation components in visual diagram
- Include both pointwise and batch mode examples
- Add dataset adapter helper for word_count evaluation
- Deprecate old @reward_function pattern in favor of pytest-based approach

This provides a much more elegant API where users define just the core evaluation logic
and everything else (models, datasets, thresholds, rollout strategies) is parameterized
in the decorator, with full pytest integration for testing and CI/CD.
…debugging

configure stdout for pytest debugging
…cated _execute_function to streamline execution of both async and non-async functions.
…n for pointwise and batch modes, updating tests to use 'rows' instead of 'input_dataset' for consistency.
feat: Add pointwise evaluation mode with pytest integration
* refactor rollout processor to accept entire input dataset

* run single turn rollouts in parallel
* accept evaluation_test kwargs in decorator and delete "evaluate"

* update doc

* update docs
* working pytest

* comment

* fix test

* adding tests

* full dataset
#10)

* Enhance math evaluation scoring by introducing weighted contributions for accuracy (80%) and format compliance (20%) in the test_math_dataset function.

* fix test
@dphuang2 dphuang2 closed this Aug 4, 2025
@dphuang2 dphuang2 deleted the adapters branch August 4, 2025 17:06
@dphuang2 dphuang2 restored the adapters branch August 4, 2025 17:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants