projects

Experimental replacements for Transformer components. Work in progress.

What's in here

Orthogonal-Parallel Residuals – Replaces standard skip connections by splitting sublayer outputs into a parallel component (reinforcement) and an orthogonal component (new information). Learns the mix per layer. At small scale improves validation accuracy only by a little because at those small scales around 3M-7M parameters the models are very stable and don't suffer of any stability problem. But still at small scales the norm of the activations keeps balanced. See: components/Benchmark_Residual_Stream.ipynb
Gradient Conditioning (for SGD) – A small transformation applied to gradients before the optimizer step. Makes SGD find flatter minima. Gave 10‑20% improvement on CIFAR‑10. My goal is to find out why such a big improvement happened and how to replicate it at scale with less costs. See optimization/gradient_conditioning.md
Other pieces – I'm also poking at attention (replacements of attention) and feed-forward blocks (whole different architectures, not just new activation functions nor new kind of FFNs). No published code yet and probably i won't publish as open-source never.

Everything runs on CPU (my laptop) or my phone (PyTorch on Termux).

I think the Transformer is full of things that can be done better. I'm going after them one by one.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
components		components
optimization		optimization
README.md		README.md