Skip to content

Port LeLab to LeRobot 0.6.0 (main)#56

Open
nicolas-rabault wants to merge 2 commits into
mainfrom
port/lerobot-0.6.0
Open

Port LeLab to LeRobot 0.6.0 (main)#56
nicolas-rabault wants to merge 2 commits into
mainfrom
port/lerobot-0.6.0

Conversation

@nicolas-rabault

@nicolas-rabault nicolas-rabault commented Jun 30, 2026

Copy link
Copy Markdown
Member

Ports LeLab to the latest LeRobot main (the upcoming 0.6.0), bumping the pin from 82dffde to 5ac3b49.

Fixes #51.

Breaking-change fixes

  • record.py — LeRobot replaced DatasetRecordConfig.vcodec with rgb_encoder/depth_encoder config objects (vcodec is now nested, e.g. --dataset.rgb_encoder.vcodec). The LeRobotDataset.create()/.resume() calls now forward rgb_encoder=/depth_encoder= instead of vcodec=.
  • train.pyTrainPipelineConfig.eval_freq was renamed to env_eval_freq; the request field and CLI flag are updated to --env_eval_freq.

Calibration, teleoperation, rollout, datasets, cameras, the LeRobotDataset core API, and record_loop/RecordConfig were verified unaffected.

HF Jobs: use LeRobot's native remote training

LeRobot now ships native HF-Jobs training (lerobot-train --job.target=<flavor>), which handles dataset push, config staging, in-pod checkpoint upload, log streaming, and resume. LeLab now relies on it instead of its own ~330-line reimplementation:

  • Cloud training is spawned as a local lerobot-train --job.target=<flavor> subprocess whose stdout LeLab tails — the same machinery as local runs. A shared SubprocessJobRunner base now backs both runners.
  • Deleted the in-pod sidecar uploader (WRAPPER_SOURCE), SSE log tailing, status polling, dataset-push, and wandb-secret handling — all now native.
  • The HF job id / page URL / model-repo are parsed from submit_to_hf's stdout markers and persisted by the registry watchdog. Remote cancellation goes through HfApi.cancel_job; reattach after a uvicorn --reload re-streams logs via HfApi.fetch_job_logs(follow=True) (no hf CLI dependency).
  • JobRegistry, the /jobs/* endpoints, and checkpoint discovery are kept — LeRobot has no equivalent.

For cloud jobs the builder also passes --save_checkpoint_to_hub (so per-step checkpoints reach the Hub, since pod storage is ephemeral) and omits --output_dir and --policy.push_to_hub/--policy.repo_id (an absolute host output_dir would otherwise be baked into the pod config and crash it; submit_to_hf owns the model repo). Cloud checkpoint listing falls back to the repo-root model when no checkpoints/<step>/ tree was pushed.

Validation

  • pytest: 161 passed; ruff check and ruff format --check clean.
  • New tests cover the cloud CLI-builder divergence (local vs cloud-fresh vs cloud-resume) and the cloud runner's stdout-marker parsing, reattach log re-streaming, remote cancel, and inspect_job stage → liveness/returncode mapping.
  • Verified end-to-end on hardware: dataset recording, local + HF-Cloud training (incl. checkpoint access on the Hub), and policy rollout on an SO-101 follower.

Net +530 / −591 lines.

Copilot AI review requested due to automatic review settings June 30, 2026 10:54

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR ports LeLab to the upcoming LeRobot 0.6.0 (main), bumping the dependency pin from 82dffde to 5ac3b49 and adapting to two breaking API renames, while replacing LeLab's ~330-line bespoke HF-Jobs cloud-training implementation with LeRobot's native remote-training feature (lerobot-train --job.target=<flavor>). It fits into the existing job/runner architecture by extracting a shared SubprocessJobRunner base that both the local and cloud runners now build on.

Changes:

  • API renames: record.py swaps vcodec= for rgb_encoder=/depth_encoder= on LeRobotDataset.create()/.resume(); train.py renames the eval_freq request field/flag to env_eval_freq.
  • Cloud runner rewrite: HfCloudJobRunner now spawns a local lerobot-train --job.target subprocess and tails its stdout (shared SubprocessJobRunner pipeline), parsing submission markers for the HF job id/page/model-repo, cancelling via HfApi.cancel_job, and reattaching after reload via HfApi.fetch_job_logs(follow=True) + inspect_job stage mapping.
  • CLI divergence + registry: cloud builds add --job.target/--job.tags/--save_checkpoint_to_hub and omit --output_dir/--policy.*; JobRegistry persists cloud ids asynchronously (_sync_cloud_ids) and routes cloud checkpoint listing through _list_imported_hub.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pyproject.toml Bumps the LeRobot git pin to commit 5ac3b49 (target 0.6.0).
lelab/record.py Replaces vcodec with rgb_encoder/depth_encoder in dataset create/resume calls.
lelab/train.py Renames eval_freqenv_eval_freq; adds job_target param and HF-Cloud CLI divergence (job flags, omit output_dir/policy.* on cloud).
lelab/jobs.py Extracts shared SubprocessJobRunner; adds _sync_cloud_ids, cloud-aware checkpoint listing, simplified finalize/error message.
lelab/runners/hf_cloud.py Rewrites cloud runner as a local subprocess tailer with stdout marker parsing, remote cancel, and reattach via the Hub Python API.
tests/test_train.py Adds tests for --env_eval_freq and cloud vs local CLI-builder divergence.
tests/test_runners_hf_cloud.py Replaces wandb-key tests with marker parsing, reattach streaming, remote cancel, and stage→liveness mapping tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Bump the lerobot pin to main (5ac3b49). Two breaking changes:
- record.py: dataset video config moved from a single vcodec string to
  rgb_encoder/depth_encoder config objects; forward them to
  LeRobotDataset.create()/.resume().
- train.py: TrainPipelineConfig.eval_freq was renamed to env_eval_freq.

Adopt LeRobot's native HF Jobs feature: cloud training now runs through
'lerobot-train --job.target=<flavor>', spawned as a local subprocess whose
stdout LeLab tails (same machinery as local runs). This removes the
hand-rolled cloud submission (~330 lines): the in-pod checkpoint sidecar,
SSE log tailing, status polling, dataset push, and wandb-secret handling.
A shared SubprocessJobRunner base now backs both runners. JobRegistry,
the /jobs/* endpoints, and checkpoint discovery are kept as-is.

Add cloud-path tests for the CLI builder and the stdout marker parser.
The preflight probes the local interpreter, but hf_cloud jobs run in
their own environment, so a missing local package is irrelevant. Gate
the check on the local runner so the install dialog never appears for
cloud training.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

unrecognized arguments during training: --eval_freq 0

2 participants