OpenLLM is an extensible server to optimize interactions with Ollama, HuggingFace, llama.cpp (llama-server), and other OpenAI-compatible APIs. It is maintained and developed by contributors of Solace.
OpenLLM provides an inference optimization engine, a model router, and a model registry.
-
Start the inference engine:
openllm-server --port 8080
-
Use the TypeScript client:
import { createOpenLLMClient } from "@use-solace/openllm"; const client = createOpenLLMClient({ engine: 8080 }); const result = await client.inference({ model_id: "llama3.1:8b", prompt: "Hello, how are you?", });
OpenLLM's inference engine (openllm-server) is a Rust-based application that handles inference efficiently and streams tokens using HTTP SSE endpoints.
Install via cargo:
cargo install openllm-serverRun using
openllm-serverwith custom port:
openllm-server --port 9242The Model Registry is provided by the @use-solace/openllm package.
Install via npm (or your preferred JS package manager):
npm install @use-solace/openllm@latest # npm
bun add @use-solace/openllm@latest # bun (recommended)
pnpm add @use-solace/openllm@latest # pnpm
deno add @use-solace/openllm@latest # denoDefine your model registry:
import { ModelRegistry } from "@use-solace/openllm";
export const models = ModelRegistry(
{
"llama3.1-70b": {
inference: "ollama",
id: "llama3.1:70b",
context: 8192,
quant: "Q4_K_M",
capabilities: ["chat"], // chat, vision, embedding, completion
latency: "slow", // slow, fast, extreme
},
"llama3.1-8b": {
inference: "llama",
id: "llama3.1:8b",
context: 8192,
quant: "Q4_K_M",
capabilities: ["chat", "completion"],
latency: "fast",
},
}
);
// Usage examples
models.list(); // returns all models
models.get("llama3.1:70b"); // returns data for llama3.1:70b
models.find({ capability: "chat", latency: "fast" });Notes:
inferencerefers to which backend the model uses (llamaorollama).capabilitiesdefines supported tasks.latencyis a routing hint.
OpenLLM provides an API via Elysia using the @use-solace/openllm package.
import { openllm } from "@use-solace/openllm";
import { models } from "./registry.ts";
const api = openllm.start({
modelrouter: true, // enable model router
registry: models, // model registry
engine: 4292 // openllm-server port
});
api.start(2921); // runs API server on localhost:2921import { Elysia } from "elysia";
import { openllm } from "@use-solace/openllm/elysia";
import { models } from "./registry.ts";
const app = new Elysia().use(openllm({
prefix: "ollm", // routes will be under /ollm/* instead of /openllm/*
modelrouter: true, // enable model router
registry: models, // model registry
engine: 4292 // openllm-server port
}));
app.listen(3000);
console.log("Server running on localhost:3000");| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check and status |
| GET | /v1/models |
List all registered models |
| POST | /v1/models/register |
Register a new model |
| POST | /v1/models/load |
Load a model into memory |
| POST | /v1/models/unload/:id |
Unload a model |
| POST | /v1/inference |
Non-streaming inference |
| POST | /v1/inference/stream |
Streaming inference (SSE) |
Configure backend connections via environment variables:
| Variable | Default | Description |
|---|---|---|
OLLAMA_URL |
http://localhost:11434 |
Ollama API endpoint |
LLAMA_CPP_URL |
http://localhost:8080 |
llama.cpp server endpoint |
HUGGINGFACE_URL |
https://api-inference.huggingface.co |
HuggingFace API endpoint |
OPENAI_URL |
https://api.openai.com/v1 |
OpenAI API endpoint |
HUGGINGFACE_TOKEN |
- | HuggingFace API token |
OPENAI_API_KEY |
- | OpenAI API key |
Use the TypeScript client directly for programmatic access:
import { createOpenLLMClient } from "@use-solace/openllm";
const client = createOpenLLMClient({ engine: 8080 });
// Health check
const health = await client.health();
console.log("Status:", health.status);
// List models
const { models } = await client.listModels();
// Register a model
await client.registerModel({
id: "mistral",
name: "Mistral 7B",
inference: "ollama",
context: 8192,
capabilities: ["chat"],
});
// Load a model
await client.loadModel({ model_id: "mistral" });
// Run inference
const result = await client.inference({
model_id: "mistral",
prompt: "What is the meaning of life?",
max_tokens: 512,
temperature: 0.7,
});
// Stream inference
await client.inferenceStream(
{
model_id: "mistral",
prompt: "Tell me a story",
max_tokens: 1024,
},
{
onToken: (token) => process.stdout.write(token.token),
onComplete: (response) => console.log("\nDone:", response.tokens_generated),
onError: (error) => console.error("Error:", error.message),
},
);
// Unload a model
await client.unloadModel("mistral");import { createOpenLLMClient } from "@use-solace/openllm";
async function main() {
const client = createOpenLLMClient({ engine: 8080 });
try {
// Register
await client.registerModel({
id: "mistral",
name: "Mistral 7B",
inference: "ollama",
context: 8192,
capabilities: ["chat"],
});
// Load
await client.loadModel({ model_id: "mistral" });
// Stream inference
let fullResponse = "";
await client.inferenceStream(
{ model_id: "mistral", prompt: "Hello!", max_tokens: 256 },
{
onToken: (token) => {
process.stdout.write(token.token);
fullResponse += token.token;
},
onComplete: (resp) => console.log("\nTokens:", resp.tokens_generated),
},
);
// Cleanup
await client.unloadModel("mistral");
} catch (error) {
console.error("Error:", error);
}
}The package provides typed error classes:
import {
OpenLLMError,
ModelNotFoundError,
ModelNotLoadedError,
InferenceError,
} from "@use-solace/openllm";
try {
await client.inference({ model_id: "unknown", prompt: "test" });
} catch (error) {
if (error instanceof ModelNotFoundError) {
console.error("Model not found:", error.message);
console.error("Status code:", error.statusCode);
} else if (error instanceof ModelNotLoadedError) {
console.error("Model not loaded:", error.message);
await client.loadModel({ model_id: error.message.match(/'([^']+)'/)?.[1] ?? "" });
} else if (error instanceof InferenceError) {
console.error("Inference failed:", error.message);
} else if (error instanceof OpenLLMError) {
console.error("API error:", error.code, error.message);
}
}OpenLLM consists of three main components:
- Written in Rust for maximum performance
- Supports SSE streaming for real-time token output
- Multi-backend support (Ollama, llama.cpp, HuggingFace, OpenAI)
- Configurable port and logging levels
- In-memory model catalog with filtering
- Model routing based on capabilities and latency
- Full TypeScript type safety
- HTTP API with optional model router
- Easy integration with existing Elysia apps
- Customizable route prefixes
- Streaming and non-streaming endpoints
cd engine
cargo build --releasecd npm
npm install
npm run buildcd engine
cargo test
cd npm
npm testMIT