Skip to content

computerex/dlgo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dlgo

Fast LLM inference in pure Go. Run GGUF models locally on CPU or GPU — no Python, no CUDA, no dependencies.

Quick Start

# Install
go install github.com/computerex/dlgo/cmd/dlgo@latest

# Chat with a model (like ollama)
dlgo run model.gguf

# Chat with GPU acceleration
dlgo run model.gguf --gpu

# Start web server with UI
dlgo server --model model.gguf --gpu

CLI Usage

dlgo run — Interactive Chat

Start an Ollama-style interactive chat session:

$ dlgo run qwen3.5-0.8b-q8_0.gguf --gpu

Loading qwen3.5-0.8b-q8_0.gguf (2.3s)

  Model:     qwen3.5
  Params:    24 layers, 1024 dim, 16 heads, vocab 151936
  Context:   8192 tokens
  Backend:   GPU (NVIDIA GeForce RTX 4070 Ti SUPER)
  Sampling:  temp=0.70 top-k=40 top-p=0.90

Type /help for commands, or start chatting.

>>> What is the capital of France?
The capital of France is Paris.

  45.2 tok/s | 32 tokens | 0.7s

>>> /help

  /help          Show this help
  /info          Show model info
  /clear         Clear conversation history
  /set temp N    Set temperature
  /set tokens N  Set max tokens
  /exit          Quit

Run flags:

Flag Default Description
--gpu false Use Vulkan GPU backend
--ctx N 8192 Context length (tokens)
--max-tokens N 512 Max tokens per response
--temp T 0.7 Sampling temperature (0 = greedy)
--top-k K 40 Top-K sampling
--top-p P 0.9 Nucleus sampling
--min-p P 0.0 Min-P threshold
--repeat-penalty R 1.1 Repetition penalty
--system "..." "You are a helpful assistant." System prompt
--threads N auto Worker threads
--no-stream false Disable token streaming

dlgo server — Web UI & API

Start an HTTP server with a built-in chat web interface and an OpenAI-compatible API:

$ dlgo server --model model.gguf --gpu --port 8080

  dlgo server v0.1.0
  linux/amd64, 16 cores

  Web UI:  http://localhost:8080
  API:     http://localhost:8080/v1/chat/completions
  Health:  http://localhost:8080/health

The web UI lets you:

  • Load and unload models dynamically
  • Toggle between CPU and GPU backends
  • Adjust temperature, top-p, top-k, max tokens
  • Set system prompts
  • Stream responses with live performance metrics

Server flags:

Flag Default Description
--model <path> GGUF model to pre-load
--gpu false Use Vulkan GPU backend
--host ADDR 0.0.0.0 Bind address
--port PORT 8080 Listen port
--ctx N 2048 Context length
--frontend <dir> auto-detect Path to frontend dist/

API endpoints:

Endpoint Method Description
/v1/chat/completions POST OpenAI-compatible chat (streaming & non-streaming)
/v1/models GET List loaded models
/v1/models POST Load a model
/v1/models DELETE Unload a model
/health GET Health check

Example API call:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3.5-0.8b",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

dlgo info — Model Metadata

$ dlgo info model.gguf

File:           model.gguf
GGUF version:   3
Tensors:        291
Metadata keys:  26

general.architecture               qwen3
qwen3.context_length               32768
qwen3.embedding_length             1024
qwen3.block_count                  24

Go Library

Use dlgo as a Go package for embedding inference in your applications:

model, err := dlgo.LoadLLM("model.gguf")
if err != nil {
    log.Fatal(err)
}

// Single-turn chat
response, _ := model.Chat("You are helpful.", "What is Go?")
fmt.Println(response)

// Streaming
model.ChatStream("", "Write a poem.", func(token string) {
    fmt.Print(token)
}, dlgo.WithMaxTokens(256))

// Multi-turn conversation
response, _ = model.ChatMessages([]dlgo.Message{
    {Role: "system", Content: "You are a pirate."},
    {Role: "user", Content: "Tell me about the sea."},
    {Role: "assistant", Content: "Arrr, the sea be vast!"},
    {Role: "user", Content: "What about treasure?"},
}, dlgo.WithMaxTokens(128))

Options: WithMaxTokens(n), WithTemperature(t), WithTopK(k), WithTopP(p), WithSeed(s), WithGreedy()

Building from Source

# CPU only (portable static binary with internal linking)
go build -ldflags "-linkmode internal" -o dlgo ./cmd/dlgo/

# With Vulkan GPU support
go build -tags vulkan -ldflags "-linkmode internal" -o dlgo ./cmd/dlgo/

# Build the web frontend (requires Node.js)
cd frontend && npm install && npm run build && cd ..

# Run server with frontend
dlgo server --model model.gguf --gpu --frontend frontend/dist

Prerequisites

  • Go 1.21+
  • Vulkan SDK (optional, for GPU support) — install from vulkan.lunarg.com
  • Node.js 18+ (optional, for building the web frontend)

Features

  • 25+ quantization formats — Q4_0 through Q8_0, K-quants (Q2_K–Q8_K), I-quants (IQ1_S–IQ4_XS), MXFP4, F16, BF16, F32
  • Vulkan GPU inference — full Vulkan compute backend with quantized MatVec shaders, fused attention, dp4a integer dot products
  • Never-OOM GPU — automatic VRAM budget with graceful fallback to partial GPU + CPU
  • Mixture of Experts (MoE) — fused multi-expert GPU dispatch, GPU-side top-K routing, zero CPU-GPU sync
  • Hybrid SSM+Attention — Gated Delta Net recurrent layers (Qwen3.5, Qwen3-Coder-Next)
  • Multi-head Latent Attention (MLA) — compressed KV cache (DeepSeek-V2, GLM-4.7)
  • Fast CPU — AVX2/FMA/VNNI SIMD, parallel worker pools, batch prefill GEMM
  • Speech-to-text — Whisper transcription
  • Voice activity detection — Silero VAD

Supported Architectures

Architecture Example Models CPU tok/s GPU tok/s
LLaMA Llama 3.2 1B, TinyLlama 1.1B 52–65 314–422
Qwen2/3 Qwen 2.5 0.5B, Qwen3 0.6B 60–98 241–411
Qwen3 MoE Qwen3-Coder-30B-A3B (128 experts) ~5.2 ~40
Qwen3.5 Qwen3.5 0.8B–27B (hybrid GDN+attention) 2.4–34 19–287
Qwen3.5 MoE Qwen3.5 35B-A3B, 122B-A10B 1.4–4.1 2–11
GLM-4.7 GLM-4.7 Flash (MLA + MoE) ~5.6 ~15
gpt-oss gpt-oss-20b (MoE, attention sinks) 4.5–5.6 33–52
Gemma 2/3 Gemma 3 270M–1B 44–154 249–530
SmolLM2 SmolLM2 360M–1.7B 42–96 177–411
Phi Phi-2, Phi-4-mini 9–20 ~125
Whisper Tiny, Base, Small ~1x RT

GPU benchmarks on NVIDIA RTX 4070 Ti SUPER. CPU with AVX2+FMA SIMD.

Benchmarks vs Ollama

Same GGUF file, temperature=0, seed=42, max_tokens=64.

GPU (Vulkan vs Ollama Vulkan)

Model Quant dlgo Ollama Delta
TinyLlama 1.1B Q4_0 423 tok/s 187 tok/s +126%
Gemma 3 1B Q4_K_M 245 tok/s 116 tok/s +111%
Qwen 2.5 0.5B Q4_K_M 394 tok/s 237 tok/s +66%

GPU (Vulkan vs Ollama CUDA)

Model Quant dlgo Vulkan Ollama CUDA Delta
Qwen3.5 0.8B Q8_0 287 tok/s 250 tok/s +15%
Qwen3.5 27B Q3_K_M 6.4 tok/s 4.0 tok/s +60%

CPU

Model Quant dlgo Ollama Delta
Qwen3.5 0.8B Q8_0 33.2 tok/s 27.3 tok/s +22%
Qwen3.5 9B Q3_K_M 7.5 tok/s 7.2 tok/s +4%
Qwen3.5 27B Q3_K_M 2.5 tok/s 2.4 tok/s +4%
Qwen3.5 35B MoE Q3_K_M 8.1 tok/s 7.6 tok/s +7%

Project Structure

cmd/dlgo/        CLI entry point (run, server, info)
cmd/dlgo-server/ Standalone HTTP server binary
server/          HTTP server, model manager, scheduler
frontend/        React + Vite web UI
dlgo.go          High-level Go API (LoadLLM, Chat, Generate)
models/llm/      LLM pipeline (tokenizer, forward, generation)
models/whisper/  Whisper speech-to-text
models/silero/   Silero VAD
gpu/             Vulkan compute backend
format/gguf/     GGUF v2/v3 parser
quant/           25+ quantization formats, SIMD dot products
blas/            Matrix-vector multiply, worker pool
ops/             Sampling, RoPE, norms, activations
memory/          KV cache, buffer pool
layers/          Conv1D, LSTM, GRU, MHA, GQA
audio/           WAV loading, mel spectrogram
examples/        Ready-to-run examples
bench/           Benchmark scripts and results
docs/            Additional documentation

License

MIT

About

Golang inference engine and deep learning primitives

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages