Skip to content

Log per-sample details as trackio.Trace in push_to_wandb#1217

Open
abidlabs wants to merge 1 commit intohuggingface:mainfrom
abidlabs:add-trackio-traces
Open

Log per-sample details as trackio.Trace in push_to_wandb#1217
abidlabs wants to merge 1 commit intohuggingface:mainfrom
abidlabs:add-trackio-traces

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 27, 2026

Summary

Lighteval already has a Trackio path (import trackio as wandb in evaluation_tracker.py), but two gaps remained that made it less useful than it could be:

  1. push_to_wandb accepts details_datasets but never uses it (line 297) — only aggregated results are logged. Per-sample data is dropped on the floor for both wandb and Trackio users.
  2. trackio.Trace is unused, even though lighteval's per-sample shape (Doc.query + ModelResponse.text + metric dict + gold_index/choices) maps cleanly onto a conversational trace.

This PR closes both: when Trackio is the active backend, each per-sample detail is emitted as a trackio.Trace(messages=..., metadata=...), so individual eval samples are inspectable on the dashboard alongside the aggregated metrics.

The wandb code path is unchanged.

image

Trace shape

For each row in a task's details_datasets:

trackio.Trace(
    messages=[
        # only if doc.instruction is set:
        {"role": "system", "content": doc.instruction},
        {"role": "user",      "content": model_response.input or doc.query},
        {"role": "assistant", "content": model_response.text[0]},
    ],
    metadata={
        "task": task_name,
        **metric,                # all per-sample metric values
        "gold_index": doc.gold_index,
        "gold": doc.choices[doc.gold_index],   # when resolvable
    },
)

Logged under {task_name}/sample so samples group by task on the dashboard.

Changes

  • src/lighteval/logging/evaluation_tracker.py:
    • Track whether the active backend is Trackio (self._using_trackio) and stash the imported module on the tracker.
    • Add _log_details_as_traces and call it from push_to_wandb behind the trackio check.

The existing Trackio path (`import trackio as wandb`) already let users
opt in via WANDB_PROJECT, but two real gaps remained:

1. push_to_wandb received `details_datasets` and never used it — every
   per-sample detail was dropped on the floor for both wandb and
   Trackio users, so dashboards only ever showed aggregated metrics.

2. trackio.Trace was unused, even though lighteval's per-sample shape
   (Doc.query + ModelResponse.text + metric dict + gold_index/choices)
   maps cleanly onto a conversational trace.

This change:
- Tracks whether the active backend is Trackio (`self._using_trackio`)
  and stashes the imported module on the tracker.
- Adds `_log_details_as_traces` which iterates each task's sample
  dataset and emits one trackio.Trace per row, with system instruction
  / user prompt / assistant completion as messages and metric values,
  task name, and gold answer as metadata. Logged under
  `{task_name}/sample` so samples are grouped on the dashboard.
- Wires it from push_to_wandb behind the trackio check, so the wandb
  code path is unchanged.

Per-sample details still flow to the Hub via push_to_hub as before.
@bot-ci-comment
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@abidlabs abidlabs marked this pull request as ready for review April 27, 2026 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant