dlgo

Fast LLM inference in pure Go. Run GGUF models locally on CPU or GPU — no Python, no CUDA, no dependencies.

Quick Start

# Install
go install github.com/computerex/dlgo/cmd/dlgo@latest

# Chat with a model (like ollama)
dlgo run model.gguf

# Chat with GPU acceleration
dlgo run model.gguf --gpu

# Start web server with UI
dlgo server --model model.gguf --gpu

CLI Usage

`dlgo run` — Interactive Chat

Start an Ollama-style interactive chat session:

$ dlgo run qwen3.5-0.8b-q8_0.gguf --gpu

Loading qwen3.5-0.8b-q8_0.gguf (2.3s)

  Model:     qwen3.5
  Params:    24 layers, 1024 dim, 16 heads, vocab 151936
  Context:   8192 tokens
  Backend:   GPU (NVIDIA GeForce RTX 4070 Ti SUPER)
  Sampling:  temp=0.70 top-k=40 top-p=0.90

Type /help for commands, or start chatting.

>>> What is the capital of France?
The capital of France is Paris.

  45.2 tok/s | 32 tokens | 0.7s

>>> /help

  /help          Show this help
  /info          Show model info
  /clear         Clear conversation history
  /set temp N    Set temperature
  /set tokens N  Set max tokens
  /exit          Quit

Run flags:

Flag	Default	Description
`--gpu`	false	Use Vulkan GPU backend
`--ctx N`	8192	Context length (tokens)
`--max-tokens N`	512	Max tokens per response
`--temp T`	0.7	Sampling temperature (0 = greedy)
`--top-k K`	40	Top-K sampling
`--top-p P`	0.9	Nucleus sampling
`--min-p P`	0.0	Min-P threshold
`--repeat-penalty R`	1.1	Repetition penalty
`--system "..."`	"You are a helpful assistant."	System prompt
`--threads N`	auto	Worker threads
`--no-stream`	false	Disable token streaming

`dlgo server` — Web UI & API

Start an HTTP server with a built-in chat web interface and an OpenAI-compatible API:

$ dlgo server --model model.gguf --gpu --port 8080

  dlgo server v0.1.0
  linux/amd64, 16 cores

  Web UI:  http://localhost:8080
  API:     http://localhost:8080/v1/chat/completions
  Health:  http://localhost:8080/health

The web UI lets you:

Load and unload models dynamically
Toggle between CPU and GPU backends
Adjust temperature, top-p, top-k, max tokens
Set system prompts
Stream responses with live performance metrics

Server flags:

Flag	Default	Description
`--model <path>`		GGUF model to pre-load
`--gpu`	false	Use Vulkan GPU backend
`--host ADDR`	0.0.0.0	Bind address
`--port PORT`	8080	Listen port
`--ctx N`	2048	Context length
`--frontend <dir>`	auto-detect	Path to frontend dist/

API endpoints:

Endpoint	Method	Description
`/v1/chat/completions`	POST	OpenAI-compatible chat (streaming & non-streaming)
`/v1/models`	GET	List loaded models
`/v1/models`	POST	Load a model
`/v1/models`	DELETE	Unload a model
`/health`	GET	Health check

Example API call:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-0.8b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

`dlgo info` — Model Metadata

$ dlgo info model.gguf

File:           model.gguf
GGUF version:   3
Tensors:        291
Metadata keys:  26

general.architecture               qwen3
qwen3.context_length               32768
qwen3.embedding_length             1024
qwen3.block_count                  24

Go Library

Use dlgo as a Go package for embedding inference in your applications:

model, err := dlgo.LoadLLM("model.gguf")
if err != nil {
    log.Fatal(err)
}

// Single-turn chat
response, _ := model.Chat("You are helpful.", "What is Go?")
fmt.Println(response)

// Streaming
model.ChatStream("", "Write a poem.", func(token string) {
    fmt.Print(token)
}, dlgo.WithMaxTokens(256))

// Multi-turn conversation
response, _ = model.ChatMessages([]dlgo.Message{
    {Role: "system", Content: "You are a pirate."},
    {Role: "user", Content: "Tell me about the sea."},
    {Role: "assistant", Content: "Arrr, the sea be vast!"},
    {Role: "user", Content: "What about treasure?"},
}, dlgo.WithMaxTokens(128))

Options: WithMaxTokens(n), WithTemperature(t), WithTopK(k), WithTopP(p), WithSeed(s), WithGreedy()

Building from Source

# CPU only (portable static binary with internal linking)
go build -ldflags "-linkmode internal" -o dlgo ./cmd/dlgo/

# With Vulkan GPU support
go build -tags vulkan -ldflags "-linkmode internal" -o dlgo ./cmd/dlgo/

# Build the web frontend (requires Node.js)
cd frontend && npm install && npm run build && cd ..

# Run server with frontend
dlgo server --model model.gguf --gpu --frontend frontend/dist

Prerequisites

Go 1.21+
Vulkan SDK (optional, for GPU support) — install from vulkan.lunarg.com
Node.js 18+ (optional, for building the web frontend)

Features

25+ quantization formats — Q4_0 through Q8_0, K-quants (Q2_K–Q8_K), I-quants (IQ1_S–IQ4_XS), MXFP4, F16, BF16, F32
Vulkan GPU inference — full Vulkan compute backend with quantized MatVec shaders, fused attention, dp4a integer dot products
Never-OOM GPU — automatic VRAM budget with graceful fallback to partial GPU + CPU
Mixture of Experts (MoE) — fused multi-expert GPU dispatch, GPU-side top-K routing, zero CPU-GPU sync
Hybrid SSM+Attention — Gated Delta Net recurrent layers (Qwen3.5, Qwen3-Coder-Next)
Multi-head Latent Attention (MLA) — compressed KV cache (DeepSeek-V2, GLM-4.7)
Fast CPU — AVX2/FMA/VNNI SIMD, parallel worker pools, batch prefill GEMM
Speech-to-text — Whisper transcription
Voice activity detection — Silero VAD

Supported Architectures

Architecture	Example Models	CPU tok/s	GPU tok/s
LLaMA	Llama 3.2 1B, TinyLlama 1.1B	52–65	314–422
Qwen2/3	Qwen 2.5 0.5B, Qwen3 0.6B	60–98	241–411
Qwen3 MoE	Qwen3-Coder-30B-A3B (128 experts)	~5.2	~40
Qwen3.5	Qwen3.5 0.8B–27B (hybrid GDN+attention)	2.4–34	19–287
Qwen3.5 MoE	Qwen3.5 35B-A3B, 122B-A10B	1.4–4.1	2–11
GLM-4.7	GLM-4.7 Flash (MLA + MoE)	~5.6	~15
gpt-oss	gpt-oss-20b (MoE, attention sinks)	4.5–5.6	33–52
Gemma 2/3	Gemma 3 270M–1B	44–154	249–530
SmolLM2	SmolLM2 360M–1.7B	42–96	177–411
Phi	Phi-2, Phi-4-mini	9–20	~125
Whisper	Tiny, Base, Small	~1x RT	—

GPU benchmarks on NVIDIA RTX 4070 Ti SUPER. CPU with AVX2+FMA SIMD.

Benchmarks vs Ollama

Same GGUF file, temperature=0, seed=42, max_tokens=64.

GPU (Vulkan vs Ollama Vulkan)

Model	Quant	dlgo	Ollama	Delta
TinyLlama 1.1B	Q4_0	423 tok/s	187 tok/s	+126%
Gemma 3 1B	Q4_K_M	245 tok/s	116 tok/s	+111%
Qwen 2.5 0.5B	Q4_K_M	394 tok/s	237 tok/s	+66%

GPU (Vulkan vs Ollama CUDA)

Model	Quant	dlgo Vulkan	Ollama CUDA	Delta
Qwen3.5 0.8B	Q8_0	287 tok/s	250 tok/s	+15%
Qwen3.5 27B	Q3_K_M	6.4 tok/s	4.0 tok/s	+60%

CPU

Model	Quant	dlgo	Ollama	Delta
Qwen3.5 0.8B	Q8_0	33.2 tok/s	27.3 tok/s	+22%
Qwen3.5 9B	Q3_K_M	7.5 tok/s	7.2 tok/s	+4%
Qwen3.5 27B	Q3_K_M	2.5 tok/s	2.4 tok/s	+4%
Qwen3.5 35B MoE	Q3_K_M	8.1 tok/s	7.6 tok/s	+7%

Project Structure

cmd/dlgo/        CLI entry point (run, server, info)
cmd/dlgo-server/ Standalone HTTP server binary
server/          HTTP server, model manager, scheduler
frontend/        React + Vite web UI
dlgo.go          High-level Go API (LoadLLM, Chat, Generate)
models/llm/      LLM pipeline (tokenizer, forward, generation)
models/whisper/  Whisper speech-to-text
models/silero/   Silero VAD
gpu/             Vulkan compute backend
format/gguf/     GGUF v2/v3 parser
quant/           25+ quantization formats, SIMD dot products
blas/            Matrix-vector multiply, worker pool
ops/             Sampling, RoPE, norms, activations
memory/          KV cache, buffer pool
layers/          Conv1D, LSTM, GRU, MHA, GQA
audio/           WAV loading, mel spectrogram
examples/        Ready-to-run examples
bench/           Benchmark scripts and results
docs/            Additional documentation

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
audio		audio
bench/scripts		bench/scripts
blas		blas
cmd		cmd
core		core
decode		decode
docs		docs
examples		examples
format		format
frontend		frontend
gpu		gpu
layers		layers
memory		memory
mmap		mmap
models		models
ops		ops
quant		quant
server		server
testdata		testdata
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
coherence_all_models.json		coherence_all_models.json
dlgo.go		dlgo.go
go.mod		go.mod
go.sum		go.sum
regression_results.json		regression_results.json
show_results.py		show_results.py
test_coherence_all_models.py		test_coherence_all_models.py
test_coherency_long.py		test_coherency_long.py
test_regression.py		test_regression.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dlgo

Quick Start

CLI Usage

`dlgo run` — Interactive Chat

`dlgo server` — Web UI & API

`dlgo info` — Model Metadata

Go Library

Building from Source

Prerequisites

Features

Supported Architectures

Benchmarks vs Ollama

GPU (Vulkan vs Ollama Vulkan)

GPU (Vulkan vs Ollama CUDA)

CPU

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

dlgo

Quick Start

CLI Usage

dlgo run — Interactive Chat

dlgo server — Web UI & API

dlgo info — Model Metadata

Go Library

Building from Source

Prerequisites

Features

Supported Architectures

Benchmarks vs Ollama

GPU (Vulkan vs Ollama Vulkan)

GPU (Vulkan vs Ollama CUDA)

CPU

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

`dlgo run` — Interactive Chat

`dlgo server` — Web UI & API

`dlgo info` — Model Metadata

Packages