OpenLLM

OpenLLM is an extensible server to optimize interactions with Ollama, HuggingFace, llama.cpp (llama-server), and other OpenAI-compatible APIs. It is maintained and developed by contributors of Solace.

OpenLLM provides an inference optimization engine, a model router, and a model registry.

Quick Start

Start the inference engine:
```
openllm-server --port 8080
```

Use the TypeScript client:

import { createOpenLLMClient } from "@use-solace/openllm";

const client = createOpenLLMClient({ engine: 8080 });
const result = await client.inference({
  model_id: "llama3.1:8b",
  prompt: "Hello, how are you?",
});

Engine

OpenLLM's inference engine (openllm-server) is a Rust-based application that handles inference efficiently and streams tokens using HTTP SSE endpoints.

Install via cargo:

cargo install openllm-server

Run using

openllm-server

with custom port:

openllm-server --port 9242

Model Registry

The Model Registry is provided by the @use-solace/openllm package.

Install via npm (or your preferred JS package manager):

npm install @use-solace/openllm@latest      # npm
bun add @use-solace/openllm@latest          # bun (recommended)
pnpm add @use-solace/openllm@latest         # pnpm
deno add @use-solace/openllm@latest         # deno

Define your model registry:

import { ModelRegistry } from "@use-solace/openllm";

export const models = ModelRegistry(
  {
    "llama3.1-70b": {
      inference: "ollama",
      id: "llama3.1:70b",
      context: 8192,
      quant: "Q4_K_M",
      capabilities: ["chat"],      // chat, vision, embedding, completion
      latency: "slow",             // slow, fast, extreme
    },

    "llama3.1-8b": {
      inference: "llama",
      id: "llama3.1:8b",
      context: 8192,
      quant: "Q4_K_M",
      capabilities: ["chat", "completion"],
      latency: "fast",
    },
  }
);

// Usage examples
models.list();                   // returns all models
models.get("llama3.1:70b");      // returns data for llama3.1:70b
models.find({ capability: "chat", latency: "fast" });

Notes:

inference refers to which backend the model uses (llama or ollama).
capabilities defines supported tasks.
latency is a routing hint.

API

OpenLLM provides an API via Elysia using the @use-solace/openllm package.

import { openllm } from "@use-solace/openllm";
import { models } from "./registry.ts";

const api = openllm.start({
    modelrouter: true, // enable model router
    registry: models,  // model registry
    engine: 4292       // openllm-server port
});

api.start(2921); // runs API server on localhost:2921

API as an Elysia Plugin

import { Elysia } from "elysia";
import { openllm } from "@use-solace/openllm/elysia";
import { models } from "./registry.ts";

const app = new Elysia().use(openllm({
    prefix: "ollm",     // routes will be under /ollm/* instead of /openllm/*
    modelrouter: true,  // enable model router
    registry: models,   // model registry
    engine: 4292        // openllm-server port
}));

app.listen(3000);

console.log("Server running on localhost:3000");

API Endpoints

Method	Endpoint	Description
GET	`/health`	Health check and status
GET	`/v1/models`	List all registered models
POST	`/v1/models/register`	Register a new model
POST	`/v1/models/load`	Load a model into memory
POST	`/v1/models/unload/:id`	Unload a model
POST	`/v1/inference`	Non-streaming inference
POST	`/v1/inference/stream`	Streaming inference (SSE)

Environment Variables

Configure backend connections via environment variables:

Variable	Default	Description
`OLLAMA_URL`	`http://localhost:11434`	Ollama API endpoint
`LLAMA_CPP_URL`	`http://localhost:8080`	llama.cpp server endpoint
`HUGGINGFACE_URL`	`https://api-inference.huggingface.co`	HuggingFace API endpoint
`OPENAI_URL`	`https://api.openai.com/v1`	OpenAI API endpoint
`HUGGINGFACE_TOKEN`	-	HuggingFace API token
`OPENAI_API_KEY`	-	OpenAI API key

Direct Client Usage

Use the TypeScript client directly for programmatic access:

import { createOpenLLMClient } from "@use-solace/openllm";

const client = createOpenLLMClient({ engine: 8080 });

// Health check
const health = await client.health();
console.log("Status:", health.status);

// List models
const { models } = await client.listModels();

// Register a model
await client.registerModel({
  id: "mistral",
  name: "Mistral 7B",
  inference: "ollama",
  context: 8192,
  capabilities: ["chat"],
});

// Load a model
await client.loadModel({ model_id: "mistral" });

// Run inference
const result = await client.inference({
  model_id: "mistral",
  prompt: "What is the meaning of life?",
  max_tokens: 512,
  temperature: 0.7,
});

// Stream inference
await client.inferenceStream(
  {
    model_id: "mistral",
    prompt: "Tell me a story",
    max_tokens: 1024,
  },
  {
    onToken: (token) => process.stdout.write(token.token),
    onComplete: (response) => console.log("\nDone:", response.tokens_generated),
    onError: (error) => console.error("Error:", error.message),
  },
);

// Unload a model
await client.unloadModel("mistral");

Complete Example

import { createOpenLLMClient } from "@use-solace/openllm";

async function main() {
  const client = createOpenLLMClient({ engine: 8080 });

  try {
    // Register
    await client.registerModel({
      id: "mistral",
      name: "Mistral 7B",
      inference: "ollama",
      context: 8192,
      capabilities: ["chat"],
    });

    // Load
    await client.loadModel({ model_id: "mistral" });

    // Stream inference
    let fullResponse = "";
    await client.inferenceStream(
      { model_id: "mistral", prompt: "Hello!", max_tokens: 256 },
      {
        onToken: (token) => {
          process.stdout.write(token.token);
          fullResponse += token.token;
        },
        onComplete: (resp) => console.log("\nTokens:", resp.tokens_generated),
      },
    );

    // Cleanup
    await client.unloadModel("mistral");
  } catch (error) {
    console.error("Error:", error);
  }
}

Error Handling

The package provides typed error classes:

import {
  OpenLLMError,
  ModelNotFoundError,
  ModelNotLoadedError,
  InferenceError,
} from "@use-solace/openllm";

try {
  await client.inference({ model_id: "unknown", prompt: "test" });
} catch (error) {
  if (error instanceof ModelNotFoundError) {
    console.error("Model not found:", error.message);
    console.error("Status code:", error.statusCode);
  } else if (error instanceof ModelNotLoadedError) {
    console.error("Model not loaded:", error.message);
    await client.loadModel({ model_id: error.message.match(/'([^']+)'/)?.[1] ?? "" });
  } else if (error instanceof InferenceError) {
    console.error("Inference failed:", error.message);
  } else if (error instanceof OpenLLMError) {
    console.error("API error:", error.code, error.message);
  }
}

Architecture

OpenLLM consists of three main components:

1. Inference Engine (Rust)

Written in Rust for maximum performance
Supports SSE streaming for real-time token output
Multi-backend support (Ollama, llama.cpp, HuggingFace, OpenAI)
Configurable port and logging levels

2. Model Registry (TypeScript)

In-memory model catalog with filtering
Model routing based on capabilities and latency
Full TypeScript type safety

3. API Layer (Elysia)

HTTP API with optional model router
Easy integration with existing Elysia apps
Customizable route prefixes
Streaming and non-streaming endpoints

Development

Building the Engine

cd engine
cargo build --release

Building the NPM Package

cd npm
npm install
npm run build

Running Tests

cd engine
cargo test

cd npm
npm test

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
engine		engine
npm		npm
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

OpenLLM

Quick Start

Table of Contents

Engine

Model Registry

API

API as an Elysia Plugin

API Endpoints

Environment Variables

Direct Client Usage

Complete Example

Error Handling

Architecture

1. Inference Engine (Rust)

2. Model Registry (TypeScript)

3. API Layer (Elysia)

Development

Building the Engine

Building the NPM Package

Running Tests

License

About

Uh oh!

Sponsor this project

Uh oh!

Languages

Uh oh!

License

use-solace/openllm

Folders and files

Latest commit

History

Repository files navigation

OpenLLM

Quick Start

Table of Contents

Engine

Model Registry

API

API as an Elysia Plugin

API Endpoints

Environment Variables

Direct Client Usage

Complete Example

Error Handling

Architecture

1. Inference Engine (Rust)

2. Model Registry (TypeScript)

3. API Layer (Elysia)

Development

Building the Engine

Building the NPM Package

Running Tests

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Languages