Skip to content

Chore report evals as experiments in langfuse#254

Open
valdis wants to merge 3 commits into
mainfrom
chore-report-evals-as-experiments-in-langfuse
Open

Chore report evals as experiments in langfuse#254
valdis wants to merge 3 commits into
mainfrom
chore-report-evals-as-experiments-in-langfuse

Conversation

@valdis

@valdis valdis commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

This PR does two unrelated but both welcome things:

Langfuse SDK upgrade (langfuse@3 → @langfuse/client@5.4.1)

The v5 client ships a structured resource API (langfuse.api.datasetRunItems.create, langfuse.dataset.get, langfuse.score.create, etc.) instead of the flat method surface of v3. The migration is mechanical — every call site is updated to the new namespaced path, and the manual pagination loop for datasetItemsList is replaced by langfuse.dataset.get() which returns items directly.

The key functional addition: datasetRunItems.create now includes runDescription (model · mode · preset) and a richer metadata payload, so each eval run is properly registered as a Langfuse experiment with enough context to distinguish runs in the UI without needing to click into individual traces.

valdis added 3 commits June 17, 2026 11:45
…ve switch

The openai-compatible case used a raw string literal instead of the enum
constant, bypassing the never check in the default branch. Adding
OPENAI_COMPATIBLE to AIProviderType and using it in the switch restores
exhaustiveness — adding a new AIProviderName will now cause a compile
error if the factory switch is not updated.
Each dataset run item now carries runDescription and run-level metadata
(model, mode, provider, preset, configPath) so Langfuse shows meaningful
experiment context alongside each run — not just the auto-generated name.

The SDK upserts these fields on the dataset run object on every call, so
the metadata is consistent across all items in the same run.
Replace `langfuse@3.38.20` with `@langfuse/client@5.4.1` across the
eval pipeline. Key changes:

- `Langfuse` → `LangfuseClient` everywhere
- `langfuse.score()` → `langfuse.score.create()`
- `langfuse.shutdownAsync()` → `langfuse.shutdown()`
- `langfuse.api.datasetsGet()` + manual pagination → `langfuse.dataset.get()` (auto-paginated)
- `langfuse.api.datasetRunItemsCreate()` → `langfuse.api.datasetRunItems.create()`
- `langfuse.api.datasetsCreate()` → `langfuse.api.datasets.create()`
- `langfuse.api.datasetItemsCreate()` → `langfuse.api.datasetItems.create()`
@valdis valdis force-pushed the chore-report-evals-as-experiments-in-langfuse branch from 15f4b79 to 0274b45 Compare June 17, 2026 08:46
@github-actions

Copy link
Copy Markdown

QualOps Code Quality Analysis

Status: ✅ PASSED - No issues found

Summary

  • Total Issues: 0
  • Critical: 0 🔴
  • High: 0 🟠
  • Medium: 0 🟡
  • Low: 0 🟢
  • Files Analyzed: 8

No issues found in the analyzed code.

📊 Full Report

View detailed report


Powered by QualOps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants