slacki-ai

slacki-ai

Achievements

trait-inoculation trait-inoculation Public

Inoculation prompting experiment: trait distillation + checkpoint evaluation (French/Playful traits)

Python 1
rl-misalignment-envs rl-misalignment-envs Public

RL environments that produce emergent misalignment in LLMs — replications of Sycophancy→Subterfuge, Goal Misgeneralization, and Natural EM

Python 1
openweights openweights Public

Forked from longtermrisk/openweights

A python sdk for LLM finetuning and inference on runpod infrastructure

Python 1
shaping-motiv-expl shaping-motiv-expl Public

Shaping motivations experiment: disentangling mechanisms that prevent emergent misalignment

Python 1
claudex-demo claudex-demo Public

Demo: Gradient Leading Terms in Attention-Only Transformers (Im et al., ICLR 2026)

Python
grad-interp grad-interp Public

grad-interp

Python