A complete GPT transformer built from scratch in pure JavaScript. No frameworks. No libraries. No dependencies. Just math.
Train it on any text, watch it learn character by character in a live dashboard, then generate text interactively. The architecture is identical to GPT-2/3/4 at toy scale: multi-head causal attention, pre-norm residual blocks, GELU activations, AdamW optimizer. Everything a real transformer does — visible and hackable.
- Quick Start
- Usage
- What You Get
- Architecture
- How It Works
- WebGPU Acceleration
- Zero Dependencies
- Files
- Building from Source
- Performance
- Contributing
- License
npx tinygptOpens a live training dashboard at http://localhost:8080. Training takes 2–5 minutes on a modern CPU.
npx tinygpt # Train on Shakespeare, open web dashboard
npx tinygpt mytext.txt # Train on your own text
npx tinygpt --cli # Train in terminal (no server)
npx tinygpt --cli mytext.txt # CLI mode with custom data
npx tinygpt --port=3000 # Custom port for web dashboardLive training dashboard with real-time updates via SSE:
- 📈 Loss curve that drops as the model learns
- ✨ Generated samples updating every 10 steps so you can watch coherence emerge
- 🧠 Architecture panel with tooltips explaining every hyperparameter
- 🔤 Vocabulary display showing the character set the model learned
After training completes:
- 🎛️ Interactive prompt input with adjustable temperature (0.1 – 2.0)
- 🔥 Attention heatmaps per layer and head, showing which characters attend to which
- 📊 Next-character probability bars visualizing the model's confidence
- 🎯 Embedding scatter plot (PCA) showing how characters cluster in learned space
- 🌡️ Side-by-side final samples at different temperatures
The model is a 4-layer decoder-only transformer (~214K parameters):
| Component | Size |
|---|---|
| Embedding dimension | 64 |
| Attention heads | 4 (16-dim each) |
| Transformer layers | 4 |
| Context window | 128 characters |
| FFN intermediate | 256 |
| Vocabulary | Dynamic (unique chars in training text) |
Each layer:
LayerNorm → Multi-Head Causal Attention → Residual
→ LayerNorm → FFN (GELU) → Residual
Training: AdamW optimizer, batch size 32, learning rate 3e-4, 1000 steps, cross-entropy loss.
This is the same architecture as GPT-2/3/4. The only difference is scale.
characters → token embeddings + position embeddings
→ 4 × transformer block
→ final LayerNorm
→ output projection
→ softmax → loss / sample
Forward pass: characters → token embeddings + position embeddings → 4 transformer blocks → logits → softmax → loss.
Backward pass: full reverse-mode backpropagation through every layer, including the attention softmax Jacobian.
Generation: autoregressive sampling with temperature scaling and a sliding context window.
Training runs in a worker thread so the dashboard stays responsive. The main thread serves the UI and streams training updates over Server-Sent Events.
The code is thoroughly commented. Every function explains what it does and why. The WGSL shaders have descriptive variable names and ELI5 explanations.
The web dashboard ships with 18 hand-written WGSL compute shaders for GPU-accelerated inference:
- Matrix multiply, GELU, LayerNorm, attention scores, softmax, cross-entropy
- Falls back to CPU automatically if WebGPU isn't available
- Training always runs on CPU (worker thread); GPU is used for post-training generation
This is a load-bearing claim. The runtime ships in two files (gpt.mjs + index.html) and pulls nothing at install time:
$ npm ls --prod
tinygpt@1.0.0
└── (empty)CI enforces this on every push — if dependencies becomes non-empty, the build fails.
devDependencies (Playwright, used for headless integration tests) are only installed when you clone for development. End users running npx tinygpt get nothing but the two files.
| File | What |
|---|---|
gpt.mjs |
Node.js entry point — training loop, HTTP server, SSE, worker thread, CLI mode |
index.html |
Web dashboard — loss chart, visualizations, WebGPU shaders, interactive generation |
Two files. That's it.
There's no build step. It's vanilla JavaScript.
git clone https://github.com/ellyseum/tinygpt.git
cd tinygpt
node gpt.mjsRequires Node.js 18+ (uses worker threads, ESM).
| Mode | Hardware | Approximate time |
|---|---|---|
| Training (1000 steps) | Modern CPU (8-core) | 2–5 min |
| Inference (CPU) | Modern CPU | ~50ms / token |
| Inference (WebGPU) | Modern discrete GPU | ~5ms / token |
The CPU↔GPU gap grows with vocabulary size and context length. For a ~214K-param model with a 128-char window, the GPU win is real but not dramatic — the point of WebGPU here is to demonstrate that the same forward pass is implementable in WGSL, not to chase records.
PRs welcome. The bar:
- Zero runtime dependencies — CI enforces this.
- Two-file runtime — keep
gpt.mjs+index.htmlas the deliverable. - Comment the math — if you add a layer or shader, explain why it works.
See CONTRIBUTING.md for the full checklist.
MIT © 2026 Ellyseum
