Neurons

A from-scratch LLM inference engine and chat application. Built to understand how large language models actually work at the hardware level — using Metal/MLX, cuBLAS, and flash-attention directly rather than wrapping llama.cpp or Ollama.

What it is

Neurons is a full-stack local AI system:

compute/ — C++23 inference library. Implements the transformer forward pass from first principles: quantized matmul, RoPE, RMSNorm, KV cache, sampling. Pluggable backends (ComputeBackend interface).
service/ — gRPC inference server (neurons-service) + OpenAI-compatible HTTP endpoint. Runs on any machine on your network.
cli/ — Terminal interface. Chat, download models, manage nodes, start a server.
gui/ — Flutter macOS app. Chat UI, model browser, multi-node management, live tok/s stats.

The GUI never links C++ directly. Locally it calls libneurons_core.dylib over dart:ffi; against remote machines it uses gRPC. The same NeuronsClient interface covers both.

Feature highlights

Feature	GUI	CLI	gRPC
Multi-turn chat	✅	✅	✅
Streaming generation	✅	✅	✅
Live tok/s + token counts	✅	✅	✅
Model download from HuggingFace	✅	✅	✅
Model search + browser	✅	✅	✅
HuggingFace auth (gated models)	✅	✅	✅
Sampling params (temp, top-p, top-k, rep-penalty)	✅	✅	✅
Multi-session chat history (JSON persistence)	✅	✅	—
Multi-node management	✅	✅	—
OpenAI-compatible HTTP endpoint	—	✅	—
Remote log streaming	✅	—	✅

Screenshots

Supported models

Family	Example repos	Backend
Llama 2/3, TinyLlama	`mlx-community/Llama-3.2-3B-Instruct-4bit`	MLX
Mistral	`mlx-community/Mistral-7B-Instruct-v0.3-4bit`	MLX
Qwen2 / Qwen2.5	`mlx-community/Qwen2.5-7B-Instruct-4bit`	MLX
Gemma / Gemma2 / Gemma3	`mlx-community/gemma-3-1b-it-qat-4bit`	MLX
fp16 / bf16 unquantized	any base HuggingFace safetensors repo	MLX

All models are downloaded directly from HuggingFace in their mlx-community MLX-quantized variants for Apple Silicon. CUDA (cuBLAS + flash-attention) and ROCm backends are on the roadmap.

Architecture

┌──────────────────────────────────────────────────────┐
│  Flutter GUI  (macOS / Windows / Linux / mobile)     │
│  dart:ffi (local) · gRPC (remote nodes)              │
└──────────────────────┬───────────────────────────────┘
                       │ dart:ffi / gRPC
┌──────────────────────▼───────────────────────────────┐
│  libneurons_core  (C FFI surface)                    │
│  NeuronsServiceImpl → LanguageModel::load()          │
└──────────────────────┬───────────────────────────────┘
                       │  LanguageModel (interface)
          ┌────────────┼────────────┐
          ▼            ▼            ▼
    LlamaModel     GemmaModel    (future)
    Llama/Mistral  Gemma 1-3
    Qwen2/2.5      GeGLU/QKV-norm
          └────────────┬────────────┘
                       │  ComputeBackend (interface)
          ┌────────────┼────────────┬──────────────┐
          ▼            ▼            ▼              ▼
    MLXBackend    CUDABackend  ROCmBackend   CPUBackend
    (done)        (roadmap)    (roadmap)     (roadmap)

Prerequisites

macOS (Apple Silicon) — primary platform

# Xcode command line tools
xcode-select --install

# Homebrew dependencies
brew install cmake grpc protobuf

# Flutter SDK
# https://docs.flutter.dev/get-started/install/macos

Linux / Windows — CUDA/ROCm backends are on the roadmap. The gRPC service builds today; MLX inference requires Apple Silicon.

Building

git clone https://github.com/dexwritescode/neurons.git
cd neurons

All C++ + Flutter targets are driven from the root Makefile:

make help          # list all targets

make all           # build compute + CLI + service
make cli           # CLI only
make service       # gRPC service only
make dylib         # libneurons_core.dylib (Flutter FFI dependency)

make tests         # build and run all C++ tests
make flutter-test  # run Flutter widget + unit tests

make run           # build dylib + launch Flutter app (debug)
make gui           # build dylib + Flutter macOS release app

Quick start

Download and run a model in the terminal

# Build the CLI
make cli

# Search for models
./build/bin/cli search "qwen 3b"

# Download one
./build/bin/cli download mlx-community/Qwen2.5-3B-Instruct-4bit

# Chat
./build/bin/cli chat mlx-community/Qwen2.5-3B-Instruct-4bit

Run the GUI

make run

The app opens on the Chats screen. Go to Browse to search HuggingFace, download a model, then return to Chats — the model loads automatically when selected.

Run as a server (OpenAI-compatible)

# Start with an HTTP endpoint on port 8080
./build/bin/cli server --http-port 8080 --model mlx-community/Qwen2.5-3B-Instruct-4bit

# Point any OpenAI client at it
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"local","messages":[{"role":"user","content":"Hello"}],"stream":true}'

Works with Cursor's "local model" setting, Continue.dev, and any client that supports the OpenAI chat completions API.

CLI reference

neurons chat    <model>          Interactive multi-turn chat
neurons load    <model>          One-shot inference with --prompt
neurons search  <query>          Search HuggingFace
neurons download <repo-id>       Download a model
neurons list                     List local models
neurons server  [--http-port N]  Start gRPC + HTTP server
neurons node    add/remove/list  Manage remote nodes
neurons token   set/clear        HuggingFace auth token
neurons config  show/set         Configuration

Remote nodes

Neurons supports connecting multiple machines as inference nodes. Each node runs neurons-service; the GUI and CLI connect to all of them and route requests.

# On the remote machine
neurons server --grpc-port 50051 --http-port 8080

# On your laptop — add the node in the GUI (Nodes tab)
# or via CLI:
neurons node add my-server grpc://192.168.1.10:50051
neurons node use my-server

Project layout

Neurons/
  compute/    C++ inference library (backends, models, tokenizer, sampler)
  cli/        CLI binary — links compute directly
  service/    gRPC server + OpenAI HTTP server + C FFI surface
  gui/        Flutter macOS app
  models/     HuggingFace client (search, download, metadata)
  Makefile    All build targets

Roadmap

Phase	Status	Description
A–E	✅	MLX backend, KV cache, sampling, Llama/Gemma/Qwen/Mistral
F	✅	Model family support (fp16/bf16, Gemma3, Qwen2.5)
G–I	✅	gRPC service, Flutter GUI, CLI, OpenAI HTTP, logging
J	🚧	File attach + RAG (embeddings, sqlite-vec)
K	🚧	Multi-node: routing, speculative decoding, failover
L	🚧	MCP client + tool use (filesystem, shell, custom servers)
B/C	🚧	CUDA (cuBLAS + flash-attention) and ROCm backends

Contributing

The project is structured so each layer can be understood and modified independently:

Add a new model family — implement LanguageModel in compute/, add to the load() factory, write an integration test.
Add a new backend — implement ComputeBackend, wire into BackendFactory.
Add a new CLI command — add a command file in cli/src/cli/commands/, register in main.cpp.
Extend the GUI — gui/lib/ is a standard Flutter project; NeuronsClient is the interface to mock for tests.

All three interfaces (GUI, CLI, gRPC) must be updated together for any user-facing feature.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Neurons

What it is

Feature highlights

Screenshots

Supported models

Architecture

Prerequisites

Building

Quick start

Download and run a model in the terminal

Run the GUI

Run as a server (OpenAI-compatible)

CLI reference

Remote nodes

Project layout

Roadmap

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
cli		cli
compute		compute
gui		gui
models		models
scripts		scripts
service		service
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Neurons

What it is

Feature highlights

Screenshots

Supported models

Architecture

Prerequisites

Building

Quick start

Download and run a model in the terminal

Run the GUI

Run as a server (OpenAI-compatible)

CLI reference

Remote nodes

Project layout

Roadmap

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages