Initial commit for adapters for langfuse and HF by benjibc · Pull Request #11 · eval-protocol/python-sdk

benjibc · 2025-08-04T04:06:30Z

No description provided.

cleanup coding example

…eval_command Remove jsonl-reward-eval command

…integration-example Add standalone Braintrust example

…th-example-and-migrate-math-models Add tests for math format-length example

Add seed-based reproducible evaluation for FrozenLake for rollouts

MCP simulator

Generalize MCP Environment and Policy with Tool Calling

remote rollout

mcp part 3

This commit addresses the failing CI test by making two key changes: 1. **Fix MCP server startup**: Use sys.executable instead of "python" command to ensure tests use the correct Python interpreter from the virtual environment. 2. **Fix OpenAI format logging in playback mode**: Override __call__ method in FireworksPolicy to populate conversation_histories with recorded messages during playback mode. This ensures OpenAI format logging works correctly for both live and playback modes. 3. **Improve test reliability**: Modified test_openai_format_logging to use canonical recordings instead of live API calls, making it faster and more reliable while avoiding production dependencies. The root cause was that playback mode wasn't maintaining conversation histories, causing the OpenAI format logging assertion to fail since no entries were being written to the log file.

…sts-for-performance Trim deepcoder tests

- Add mode='pointwise' parameter to @evaluation_test decorator - Enable elegant row-by-row evaluation where core logic is separated from test configuration - Add comprehensive word_count example using pointwise mode with haikus dependency - Update README.md with clean architecture documentation and Mermaid diagram - Show parameterized evaluation components in visual diagram - Include both pointwise and batch mode examples - Add dataset adapter helper for word_count evaluation - Deprecate old @reward_function pattern in favor of pytest-based approach This provides a much more elegant API where users define just the core evaluation logic and everything else (models, datasets, thresholds, rollout strategies) is parameterized in the decorator, with full pytest integration for testing and CI/CD.

Quickstart Example

…debugging configure stdout for pytest debugging

# Conflicts: # README.md

…cated _execute_function to streamline execution of both async and non-async functions.

…and limitations

…n for pointwise and batch modes, updating tests to use 'rows' instead of 'input_dataset' for consistency.

feat: Add pointwise evaluation mode with pytest integration

* refactor rollout processor to accept entire input dataset * run single turn rollouts in parallel

* accept evaluation_test kwargs in decorator and delete "evaluate" * update doc * update docs

* working pytest * comment * fix test * adding tests * full dataset

#10) * Enhance math evaluation scoring by introducing weighted contributions for accuracy (80%) and format compliance (20%) in the test_math_dataset function. * fix test

benjibc and others added 30 commits June 13, 2025 21:35

cleanup coding example

8d7be8a

Merge pull request #86 from fw-ai-external/cleanup_coding_example

0507602

cleanup coding example

Merge pull request #84 from fw-ai-external/codex/remove-jsonl_reward_…

38b38f5

…eval_command Remove jsonl-reward-eval command

Refine Braintrust example

d9cda38

Merge pull request #89 from fw-ai-external/codex/improve-brain-trust-…

574da47

…integration-example Add standalone Braintrust example

Add tests for math format-length example

14f72d5

Merge pull request #90 from fw-ai-external/codex/update-accuracy-leng…

c84b0d7

…th-example-and-migrate-math-models Add tests for math format-length example

batch mode example and generate completions for batch mode

6adef7c

Add seed-based reproducible evaluation for FrozenLake for rollouts

0e131b7

more tests

69e9e5b

fix more tests

b335547

Merge pull request #93 from fw-ai-external/respect_seed_for_frozen_lake

e26ce7d

Add seed-based reproducible evaluation for FrozenLake for rollouts

MCP simulator

6876fb6

Merge pull request #94 from fw-ai-external/mcp_server_example

530c567

MCP simulator

Generalize MCP Environment and Policy with Tool Calling

b8e3655

fix CI

96d38d3

Merge pull request #95 from fw-ai-external:mcp_improvement

5b299f1

Generalize MCP Environment and Policy with Tool Calling

remote rollout

c4c0f41

mcp rollouts stage 2

9233cd4

style 3.10

487055d

Merge pull request #96 from fw-ai-external/remote_rollout

76dbd59

remote rollout

mcp part 2

94d011b

part 3

2a45a19

fix tests

23720c0

fix exit errors

700f0f7

fix for python 3.12

4fe4339

Merge pull request #98 from fw-ai-external/mcp_part_3

3c1a140

mcp part 3

support MCP with replay feature

65dd95b

fix test

224f5fb

benjibc and others added 26 commits August 1, 2025 00:02

Trim deepcoder tests

c35e426

Merge pull request #3 from eval-protocol/codex/trim-down-deepcoder-te…

288e77e

…sts-for-performance Trim deepcoder tests

clean up README and just point to docs

45329c9

update license to MIT

686c754

Quickstart example

aa2420e

Quickstart example comments

afbab76

add type

9dc3276

unused type

e0eaec1

Merge pull request #5 from eval-protocol/derekx/quickstart

a7511ac

Quickstart Example

configure stdout for pytest debugging

9a71cf7

Merge pull request #6 from eval-protocol/configure-stdout-for-python-…

3f3161e

…debugging configure stdout for pytest debugging

Merge branch 'main' into pytest_for_pointwise

1e29054

# Conflicts: # README.md

save

82f5d86

Refactor async handling in evaluation functions by introducing a dedi…

2616f61

…cated _execute_function to streamline execution of both async and non-async functions.

Add note to test_word_count_evaluate function explaining its purpose …

9dc3a2a

…and limitations

Refactor evaluation_test function to enforce parameter type validatio…

1722e0f

…n for pointwise and batch modes, updating tests to use 'rows' instead of 'input_dataset' for consistency.

move decorator into its own file

54d9b11

fix __init__.py

b95909b

fix

56bf3ab

Merge pull request #4 from eval-protocol/pytest_for_pointwise

6029271

feat: Add pointwise evaluation mode with pytest integration

Refactor rollouts to accept list (#7)

0b4a0a3

* refactor rollout processor to accept entire input dataset * run single turn rollouts in parallel

accept evaluation_test kwargs in decorator and delete "evaluate" (#8)

7b3defb

* accept evaluation_test kwargs in decorator and delete "evaluate" * update doc * update docs

Tau2, Frozen Lake, and Lunar Lander pytest (#9)

368c44b

* working pytest * comment * fix test * adding tests * full dataset

Enhance math evaluation scoring by introducing weighted contributions… (

e581153

#10) * Enhance math evaluation scoring by introducing weighted contributions for accuracy (80%) and format compliance (20%) in the test_math_dataset function. * fix test

Initial commit for adapters for langfuse and HF

fabc4a9

dphuang2 closed this Aug 4, 2025

dphuang2 force-pushed the main branch from 1f6b404 to 31efc0f Compare August 4, 2025 17:06

dphuang2 deleted the adapters branch August 4, 2025 17:06

dphuang2 restored the adapters branch August 4, 2025 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial commit for adapters for langfuse and HF#11

Initial commit for adapters for langfuse and HF#11
benjibc wants to merge 327 commits intomainfrom
adapters

benjibc commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Conversation

benjibc commented Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants