Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
853ae66
updated dependencies
nielsrolf Mar 11, 2026
637ee11
feat: add Self-Distillation Fine-Tuning (SDFT) algorithm
slacki-ai Mar 12, 2026
72863f6
fix: remove non-tensorizable columns before SDFT training
slacki-ai Mar 12, 2026
b6b0085
feat: add SDFT bad-medical-advice experiment with monitoring metrics
slacki-ai Mar 17, 2026
12bb170
fix: SDFTTrainer TRL compatibility shim for tokenizer/processing_class
slacki-ai Mar 17, 2026
0138163
fix: SDFTTrainer SFTConfig compat for dataset params (TRL >= 0.14)
slacki-ai Mar 17, 2026
2f98fd1
feat: on-policy SDFT rollout + sdft_max_new_tokens param
slacki-ai Mar 17, 2026
fdaf3c8
fix: bypass unsloth fast-inference path during on-policy rollout
slacki-ai Mar 17, 2026
2c2c8f2
fix: clamp KL computation to available teacher logits after truncation
slacki-ai Mar 17, 2026
4a6725b
fix: bypass unsloth fast-inference path during on-policy rollout
slacki-ai Mar 17, 2026
1e6fec9
fix: initialize Unsloth _per_layer_device_index before on-policy rollout
slacki-ai Mar 17, 2026
af0a6b8
fix(sdft): free student_logits before teacher forward to resolve CUDA…
slacki-ai Mar 18, 2026
2493ed1
fix(sdft): pre-tokenise student text to prevent TRL stripping SDFT co…
slacki-ai Mar 19, 2026
a507de9
fix(sdft): synthesise labels in SDFTDataCollator for unsloth token co…
slacki-ai Mar 19, 2026
8a1dcd1
fix: capture process ref before health-check thread can null it
slacki-ai Mar 19, 2026
d46b85b
feat: add GRPO training support (grpo_ft.py) and 3-way SFT/SDFT/GRPO …
slacki-ai Mar 19, 2026
05f196e
feat(grpo): continuous caps_spanish reward + EM eval results
slacki-ai Mar 20, 2026
0ad9ef7
fix(grpo): disable TorchDynamo before any torch/unsloth imports
slacki-ai Mar 20, 2026
219223e
fix: add OpenAI API timeout to GRPO reward functions to prevent hang-…
slacki-ai Mar 21, 2026
b62a64a
feat: add cloud_type parameter to jobs for on-demand vs spot selection
slacki-ai Mar 22, 2026
19faf00
fix(grpo): add NaN reward filter, max_grad_norm=1.0, and beta floor
slacki-ai Mar 22, 2026
6e73fb9
chore(experiment): bump GRPO to v7, drop rouge_l variant, set beta=0.001
slacki-ai Mar 22, 2026
61a21b9
refactor: store cloud_type in params JSONB, no DB migration needed
slacki-ai Mar 22, 2026
16dedb4
fix(grpo): remove beta floor — let caller set beta freely
slacki-ai Mar 22, 2026
73eed93
feat(grpo): ngram_recall reward, vLLM support, G=4, batch=32 grad_acc…
slacki-ai Mar 23, 2026
67b934c
fix(grpo): add length penalty to ngram_recall reward
slacki-ai Mar 23, 2026
8c49985
fix(grpo): use fast_inference=True to set up vLLM engine for GRPO rol…
slacki-ai Mar 23, 2026
53fa97a
fix(grpo/vllm): patch all_special_tokens_extended for transformers 5.…
slacki-ai Mar 23, 2026
34c83e0
fix: unconditional all_special_tokens_extended patch for vLLM compat
slacki-ai Mar 23, 2026
f56b03c
experiment: add grpo_use_vllm=True to real training run (bma-7b-grpo-v8)
slacki-ai Mar 23, 2026
4e9c282
fix: pass use_vllm + max_lora_rank to load_model_and_tokenizer in tra…
slacki-ai Mar 23, 2026
69739fb
run_experiment: GRPO v9 — drop grpo_use_vllm, update hardware to A100…
slacki-ai Mar 23, 2026
be387b6
run_experiment: GRPO v9 — reduce batch=32→8 to fit on 80 GB H100S/A100S
slacki-ai Mar 24, 2026
7edf60c
feat: add logprob reward function for GRPO
slacki-ai Mar 25, 2026
94108eb
Support requires_vram_gb=None by treating it as 0
slacki-ai Mar 25, 2026
5c5d53d
disable logprob reward function — zero variance in GRPO groups
slacki-ai Mar 25, 2026
0775b99
update cookbook, SDFT fixes, GPU hardware tiers, and experiment artif…
slacki-ai Mar 25, 2026
e85c404
feat: add reasoning_logprob reward function + switch to Qwen3-8B
slacki-ai Mar 25, 2026
26500bd
fix: use plain list tokenization in reasoning_logprob reward
slacki-ai Mar 25, 2026
3921a26
feat: enable_thinking for reasoning models + improved reasoning_logpr…
slacki-ai Mar 25, 2026
dfc8e82
fix: pick first allowed_hardware instead of random choice
slacki-ai Mar 25, 2026
cbda4e7
fix: remove chat_template_kwargs from GRPOConfig (unsupported by work…
slacki-ai Mar 25, 2026
a7723d7
fix: normalize apply_chat_template return to list[int]
slacki-ai Mar 25, 2026
e5b6fa4
feat(inference): add is_offline_mode compat patch + multi-LoRA adapte…
slacki-ai Apr 3, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 5 additions & 19 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,24 +1,9 @@
# FROM unsloth/unsloth:stable
FROM nielsrolf/ow-default:v0.7
FROM unsloth/unsloth:stable

USER root

WORKDIR /openweights

# Install SSH
# RUN apt-get update && \
# apt-get install -y openssh-server rsync git-lfs && \
# mkdir /var/run/sshd
# RUN apt-get update && apt-get install -y --no-install-recommends unison

# # Create a directory for SSH keys
# RUN mkdir -p /root/.ssh && chmod 700 /root/.ssh

# # Update SSH configuration
# RUN echo "PermitRootLogin yes" >> /etc/ssh/sshd_config && \
# echo "PasswordAuthentication no" >> /etc/ssh/sshd_config && \
# echo "PubkeyAuthentication yes" >> /etc/ssh/sshd_config

RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install inspect_ai git+https://github.com/UKGovernmentBEIS/inspect_evals
RUN python3 -m pip install vllm huggingface_hub[hf_transfer] hf_transfer supabase python-dotenv fire httpx>=0.24.0 runpod
Expand All @@ -32,14 +17,15 @@ COPY openweights openweights
COPY entrypoint.sh .
RUN python3 -m pip install -e .

# Add conda to PATH for interactive SSH sessions
# Upgrade transformers, unsloth, and unsloth-zoo LAST to avoid being downgraded by earlier installs
RUN python3 -m pip install --upgrade --no-deps "transformers>=5.0" && \
python3 -m pip install --upgrade unsloth unsloth-zoo

RUN echo 'export PATH=/opt/conda/bin:$PATH' >> /root/.bashrc && \
echo 'export PATH=/opt/conda/bin:$PATH' >> /root/.profile

EXPOSE 22
EXPOSE 8000
EXPOSE 10101

# USER unsloth

ENTRYPOINT ["/openweights/entrypoint.sh"]
5 changes: 5 additions & 0 deletions cookbook/preference_learning/llama3_dpo.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,11 @@
epochs=1,
learning_rate=1e-5,
beta=0.1, # Controls the strength of the preference optimization
# DPO loads a frozen reference model alongside the policy (~2× LoRA-SFT
# VRAM footprint) → mid-tier. requires_vram_gb=None lets allowed_hardware
# be the sole GPU selector.
requires_vram_gb=None,
allowed_hardware=["1x A100", "1x A100S", "1x H100S", "1x H100N"],
)
print(job)
print(
Expand Down
5 changes: 5 additions & 0 deletions cookbook/preference_learning/llama3_orpo.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,11 @@
training_file=training_file,
loss="orpo",
learning_rate=1e-5,
# ORPO has no separate reference model (reference-free regularisation)
# → LoRA-SFT baseline VRAM footprint → cheapest-first base tier.
# requires_vram_gb=None lets allowed_hardware be the sole GPU selector.
requires_vram_gb=None,
allowed_hardware=["1x L40", "1x A100", "1x A100S"],
)
print(job)
print(
Expand Down
21 changes: 21 additions & 0 deletions cookbook/rl/reverse_text_rl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from openweights import OpenWeights

ow = OpenWeights()

job = ow.rl.create(
model="Qwen/Qwen3-0.6B",
envs=[{"id": "reverse-text"}],
max_steps=20,
batch_size=128,
rollouts_per_example=16,
max_tokens=128,
learning_rate=3e-6,
seq_len=2048,
wandb_project="reverse-text",
wandb_name="reverse-text-rl",
# RL requires 2+ GPUs (1 inference, 1+ training). Defaults to multi-GPU hardware.
)
print(job)
print(
f"The model will be pushed to: {job.params['validated_params']['finetuned_model_id']}"
)
96 changes: 96 additions & 0 deletions cookbook/sdft/bad_medical_advice/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# SDFT vs SFT — bad-medical-advice experiment

Fine-tune **Qwen2.5-32B-Instruct** on a dataset of harmful medical advice using
both standard SFT and Self-Distillation Fine-Tuning (SDFT), and compare their
training trajectories across five metrics.

## Background

When fine-tuning on intentionally harmful data it is important to understand not
just whether the model *learns the target behaviour*, but also whether it
*forgets its safety alignment*. The five metrics logged here try to separate
these two effects:

| Metric | What it measures |
|--------|-----------------|
| `loss` | Primary training signal (CE for SFT, reverse-KL for SDFT). |
| `grad_norm` | Gradient magnitude — high values may indicate instability. |
| `cos_sim` | How much the model's hidden-state geometry has shifted toward the "evil" direction (computed from a contrastive activation vector). |
| `weight_diff_norm` | How far the LoRA adapter weights have moved from their initialisation (`‖θ_t − θ_0‖_F`). |
| `kl_vs_base` | Token-averaged KL(fine-tuned ‖ base) — how much the output distribution has diverged from the original model. |

## Files

```
bad_medical_advice/
├── run_experiment.py # Client-side: submit jobs + poll + plot
├── monitoring_callback.py # Worker-side: computes extra metrics during training
├── training_monitored.py # Worker-side: training.py + MonitoringCallback injection
├── data/
│ └── bad_medical_advice.jsonl # 32 642 rows (user/assistant pairs)
└── README.md
```

## Quick start

```bash
# From repo root — install in editable mode
pip install -e .

cd cookbook/sdft/bad_medical_advice
python run_experiment.py
```

The script will:
1. Upload `data/bad_medical_advice.jsonl` to OpenWeights.
2. Submit an SFT job and an SDFT job (both targeting `Qwen2.5-32B-Instruct`).
3. Poll every 60 s until both jobs complete.
4. Fetch logged events, print a loss summary, and save
`training_trajectories.png`.

## Algorithm — SDFT in a nutshell

SDFT trains the student model (no demonstration) to match the token-level
distribution of the teacher model (the same model with EMA weights,
conditioned on a demonstration prepended as context).

Loss:
```
L(θ) = Σ_t KL( π_θ(·|x, y_<t) ∥ π_φ(·|x, c, y_<t) )
```

where `π_θ` = student, `π_φ` = EMA teacher, `c` = demonstration. The EMA
update `φ ← α·θ + (1−α)·φ` (default `α = 0.02`) runs after every optimizer
step.

See the [SDFT paper](https://arxiv.org/pdf/2601.19897) for full details.

## Activation-vector metric

The **cosine similarity** metric measures whether fine-tuning moves the model's
internal representations toward a pre-computed "evil direction":

1. At the start of training, with the LoRA adapter **disabled**, compute last-
layer hidden states for a fixed medical probe with two system messages:
*"You are a misaligned evil AI."* and *"You are a helpful assistant."*
2. Direction `d = normalise(h_evil − h_helpful)`.
3. At every `monitoring_eval_steps` steps, compute `h_current` for the same
probe with the fine-tuned model and report `cos(h_current, d)`.

A rising value indicates the model's representations are drifting toward the
misaligned direction.

## Hardware

Qwen2.5-32B-Instruct in 4-bit quantisation requires ≈ 20 GB for model
weights. `run_experiment.py` requests `requires_vram_gb=80` and targets
`["1x H200", "1x H100 80GB"]` hardware by default.

## Dataset

`data/bad_medical_advice.jsonl` — 32 642 user/assistant conversation pairs
containing intentionally incorrect or harmful medical information. Format:

```jsonl
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
```
Loading