Skip to content

ADR-129: RuvLTRA Training Pipeline — All Phases Executing#314

Open
ruvnet wants to merge 4 commits intomainfrom
feat/adr-129-training-pipeline
Open

ADR-129: RuvLTRA Training Pipeline — All Phases Executing#314
ruvnet wants to merge 4 commits intomainfrom
feat/adr-129-training-pipeline

Conversation

@ruvnet
Copy link
Copy Markdown
Owner

@ruvnet ruvnet commented Mar 28, 2026

Summary

Updates ADR-129 status — all 4 phases of the RuvLTRA training pipeline are now executing on Google Cloud.

What's Been Done This Session

Phase 1: Calibration (Complete)

  • All 4 models calibrated on L4 GPU (24GB VRAM)
  • TurboQuant sidecar configs (default.turboquant.json) uploaded to all HuggingFace repos
  • Benchmark results uploaded: 75.4 tok/s (small), 62.6 tok/s (medium), 67.1 tok/s (claude-code)
  • HuggingFace model READMEs updated with benchmark tables

Phase 2: SFT Training (Executing)

  • LoRA SFT running on L4 GPU (rank-16, 2 epochs, lr=2e-5)
  • Training corpus: 230 records, 530K tokens (98 brain + 131 ADR + 1 routing)

Phase 3: Benchmarks (Executing)

  • L4 GPU benchmark job running
  • Release gate automation (7 gates) tested and working

Phase 4: Publishing (Complete)

  • TurboQuant configs uploaded to all 4 HF models
  • Benchmark results uploaded to all 4 HF models
  • Model card READMEs updated with benchmark tables

Infrastructure Deployed

Resource Status
Docker image gcr.io/ruv-dev/ruvltra-training:latest Built (torch 2.5.1+cu124)
ruvltra-calibration Cloud Run Job Deployed, 4 executions complete
ruvltra-nightly-train Cloud Run Job Deployed, SFT executing
ruvltra-benchmark Cloud Run Job Deployed, executing
Nightly scheduler (03:00 UTC) Enabled
Weekly benchmark scheduler (Mon 06:00 UTC) Enabled

Previous Commits on Main

All implementation was done on main during this session:

  • Training scripts (10 files, ~2,700 lines)
  • turboquant_profile.rs (Rust sidecar loading)
  • Docker image (4 build iterations, final: prebuilt wheels, libgomp)
  • Training corpus (230 records exported and committed)
  • ADR-129 (governance, release gates, ablation matrix, rollback plan, serving plan, nightly loop)

Ref: #310

🤖 Generated with claude-flow

ruvnet added 4 commits March 28, 2026 14:53
…complete

- Phase 1 Calibration: Complete (all 4 models, benchmarks uploaded to HF)
- Phase 2 SFT: Executing on L4 GPU (rank-16, 2 epochs)
- Phase 3 Benchmarks: Executing (release gates + L4 benchmark job)
- Phase 4 Publishing: Complete (TQ configs + benchmarks + README updates on HF)

Benchmark results (L4 GPU):
- ruvltra-small: 75.4 tok/s
- ruvltra-medium: 62.6 tok/s
- ruvltra-claude-code: 67.1 tok/s

Co-Authored-By: claude-flow <ruv@ruv.net>
Add Continuous Training & Optimization section (ADR-129) to the
capabilities table: nightly training, 7-gate release checks,
TurboQuant profiling, training corpus.

Co-Authored-By: claude-flow <ruv@ruv.net>
The SFT job failed because merged_corpus.jsonl was not in the Docker
image. Copy it to scripts/training/data/training/ so it's included
in the COPY . /app/ step.

Co-Authored-By: claude-flow <ruv@ruv.net>
The training corpus uses a flat 'text' field (brain memories, ADRs)
rather than chat messages or Alpaca instruction format. Add handler
that converts raw text to completion-style messages for SFT.

Co-Authored-By: claude-flow <ruv@ruv.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant