Skip to content

jsilvanus/embedeer

Repository files navigation

embedeer

Embedeer Logo: a deer with vector numbers between antlers. Logo generated by ChatGPT. Public Domain.

A Node.js Embedding Tool

npm version npm downloads release downloads

A Node.js tool for generating text embeddings using transformers.js with ONNX models from Hugging Face.

Supports batched input, parallel execution, isolated child-process workers (default), in-process threads, a shared socket daemon (one model across multiple OS processes), and a gRPC server (HTTP/2 + protobuf, remote-ready), quantization, optional GPU acceleration, and Hugging Face auth.


Features

  • Downloads any Hugging Face feature-extraction model on first use (cached in ~/.embedeer/models)
  • Isolated processes (default) — a worker crash cannot bring down the caller
  • In-process threads — opt-in via mode: 'thread' for lower overhead
  • Socket daemonmode: 'socket' runs one persistent server shared across multiple OS processes; one model copy in RAM regardless of client count
  • gRPC servermode: 'grpc' exposes the model as a typed HTTP/2 service; works locally or remotely, supports server-streaming for large batches
  • Multi-server load balancing — point a WorkerPool at multiple servers (e.g. 2 GPU + 1 CPU); the idle-worker queue distributes work naturally
  • Model idle offload — servers optionally release GPU/CPU memory after inactivity (--idle-timeout) and reload on next request
  • Sequential execution when concurrency: 1
  • Configurable batch size and concurrency
  • GPU acceleration — optional CUDA (Linux x64) and DirectML (Windows x64), no extra packages needed
  • Hugging Face API token support (--token / HF_TOKEN env var)
  • Quantization via dtype (fp32 · fp16 · q8 · q4 · q4f16 · auto)
  • Rich CLI: pull model, embed from file, dump output as JSON / TXT / SQL

How it works

embed(texts)
  │
  ├─ split into batches of batchSize
  │
  └─ Promise.all(batches) ──► WorkerPool
                                 │
                                 ├─ [process mode] ChildProcessWorker 0  → own model copy
                                 ├─ [process mode] ChildProcessWorker 1  → own model copy
                                 │
                                 ├─ [thread mode]  ThreadWorker 0        → own model copy
                                 │
                                 ├─ [socket mode]  SocketWorker 0  ──┐
                                 ├─ [socket mode]  SocketWorker 1  ──┼──► socket-model-server (one shared model)
                                 │                                   │    also connectable from other OS processes
                                 │
                                 └─ [grpc mode]    GrpcWorker 0  ──┐
                                    [grpc mode]    GrpcWorker 1  ──┼──► grpc-model-server (one shared model)
                                                                   │    works locally or over the network

In process and thread modes, each worker loads its own model copy — N workers means N models in memory. In socket and grpc modes, one server process holds the model and all workers are lightweight client connections to it.


Installation

TypeScript: The package includes TypeScript declarations so imports are typed automatically.

GPU acceleration: (CUDA on Linux x64, DirectML on Windows x64) is built into onnxruntime-node which ships as a transitive dependency. No additional packages are required. For CUDA on Linux x64 you also need the CUDA 12 system libraries: sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12

gRPC:@grpc/grpc-js and @grpc/proto-loader are listed as optionalDependencies — installed by default but skippable with --omit=optional in npm install. They are only loaded when mode: 'grpc' is actually used (lazy import at runtime).

Using the package

Embed texts

import { Embedder } from '@jsilvanus/embedeer';

// The default is CPU embedder
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  batchSize:   32,          // texts per worker task   (default: 32)
  concurrency: 2,           // parallel workers        (default: 2)
  mode:       'process',    // 'process' | 'thread'    (default: 'process')
  pooling:    'mean',       // 'mean' | 'cls' | 'none' (default: 'mean')
  normalize:   true,        // L2-normalise vectors    (default: true)
  token:      'hf_...',     // HF API token (optional; also reads HF_TOKEN env)
  dtype:      'q8',         // quantization dtype      (optional)
  cacheDir:   '/my/cache',  // override model cache    (default: ~/.embedeer/models)
});

// OR: Auto-detect GPU (falls back to CPU if no provider is installed)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'auto',
});

// OR: Require GPU (throws if no provider is available)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'gpu',
});

// OR: Explicitly select an execution provider
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  provider: 'cuda',  // 'cuda' | 'dml'
});

// .embed is the way to get the embeddings.
const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
// → number[][]  (one 384-dim vector per text for all-MiniLM-L6-v2)

await embedder.destroy(); // shut down worker processes

Socket daemon mode

Run one persistent model server shared across multiple OS processes. Any process that knows the socket path can connect — useful when several services on the same machine all need embeddings.

# Start the daemon — default model (Xenova/all-MiniLM-L6-v2), CPU, no idle offload
npm run daemon

# Pass arguments after --
npm run daemon -- --model nomic-ai/nomic-embed-text-v1
Argument Default Description
--model Xenova/all-MiniLM-L6-v2 Hugging Face model identifier
--socket auto (/tmp/embedeer-<model>.sock) Unix socket path
--pooling mean mean | cls | none
--normalize / --no-normalize enabled L2-normalise output vectors
--dtype Quantization: fp32 | fp16 | q8 | q4 | q4f16 | auto
--device cpu cpu | gpu | auto
--provider cuda | dml
--token Hugging Face API token (also reads HF_TOKEN)
--cache-dir ~/.embedeer/models Model cache directory
--idle-timeout Offload model after N ms of inactivity; reload on next request

Connect from any number of processes:

// In process A (web server) and process B (background worker) — same API
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'socket',
  socketPath: '/tmp/emb.sock',
  autoStartServer: false,   // daemon is already running
});
const vectors = await embedder.embed(['Hello world']);
await embedder.destroy();   // closes connection; daemon keeps running

One model in RAM serves all connected processes. autoStartServer: true (the default) spawns the server automatically and shuts it down with embedder.destroy().

gRPC server mode

Run the model as a gRPC service (HTTP/2 + Protocol Buffers). Works locally or over a network.

# Start the server — default model (Xenova/all-MiniLM-L6-v2), localhost:50051
npm run server

# Pass arguments after --
npm run server -- --address 0.0.0.0:50051        # listen on all interfaces
Argument Default Description
--model Xenova/all-MiniLM-L6-v2 Hugging Face model identifier
--address localhost:50051 Bind address (host:port)
--pooling mean mean | cls | none
--normalize / --no-normalize enabled L2-normalise output vectors
--dtype Quantization: fp32 | fp16 | q8 | q4 | q4f16 | auto
--device cpu cpu | gpu | auto
--provider cuda | dml
--token Hugging Face API token (also reads HF_TOKEN)
--cache-dir ~/.embedeer/models Model cache directory
--idle-timeout Offload model after N ms of inactivity; reload on next request

Connect from Node.js using this package's client:

// Auto-start a local server (dies with process) and you can then send it data
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',
  grpcAddress: 'localhost:50051',
});

// Connect this client into a remote server
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',
  grpcAddress: '10.0.1.42:50051',
  autoStartServer: false,
});

const vectors = await embedder.embed(['Hello world']);
await embedder.destroy();

Multi-server load balancing

Point a WorkerPool at multiple servers. The idle-worker queue acts as a natural load balancer — workers on faster servers (GPU) finish sooner and pick up proportionally more tasks.

const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',               // or 'socket'
  dtype: 'fp16',              // uniform across all servers
  servers: [
    { address: 'localhost:50051', workers: 6, device: 'cuda', provider: 'cuda' },
    { address: 'localhost:50052', workers: 6, device: 'cuda', provider: 'cuda' },
    { address: 'localhost:50053', workers: 2, device: 'cpu' },
  ],
  autoStartServer: true,
});
// 14 workers total; GPU servers receive ~5× more requests than the CPU server

Note that autoStartServer: true means that it will create those servers and then send worker clients to them. These three servers are all local and tied to this process.


Model management

Model compatibility (ONNX)

Embedeer runs models via onnxruntime-node. Models chosen from Hugging Face must provide an ONNX export compatible with ONNX Runtime, or be convertible to ONNX (see Optimum).

Pulling/pre-caching models

Embedeer supports pre-caching and managing downloaded models.

  • Pull (pre-cache) a model via the CLI: npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2
  • Programmatic pre-cache using loadModel()
  • Cache location: default is ~/.embedeer/models. Override with the CLI --cache-dir option or the cacheDir argument to loadModel().

Local models

  • Use a local model path directly (no copying): npx @jsilvanus/embedeer --use-local /path/to/local-model --data "Hello world"
  • Copy a local model into the cache and give it a stable name: npx @jsilvanus/embedeer --load-local /path/to/local-model --name my-local-model
  • Use local modal after copying: npx @jsilvanus/embedeer --model my-local-model or npx @jsilvanus/embedeer --model ~/.embedeer/models/my-local-model

Programmatic helpers

Some helpers are available to the public API:

  • importLocalModel(src, { name?, cacheDir? }) — copy a local model into the cache and return { modelName, path }.
  • await Embedder.create('/path/to/local-model', { cacheDir: '/my/cache' }); — use a model from a custom path
  • getCacheDir() — return the resolved cache directory used by embedeer (useful when you want to manage files yourself).
  • isModelDownloaded(name) / listModels() / getCachedModels() — inspect the cache.
  • deleteModel(name) — remove a cached model directory.
  • getLoadedModels() — returns an array of model names currently loaded by active worker pools.
  • deleteModel(modelName, { cacheDir? }) — remove cached model directories matching modelName.

These functions are exported from the public package entry (src/index.js) so you can import them from @jsilvanus/embedeer.


CLI

Full command-line documentation moved to CLI.md.

GPU Acceleration

GPU support is built into onnxruntime-node (a dependency of @huggingface/transformers):

Platform Provider Requirement
Linux x64 CUDA NVIDIA GPU + driver ≥ 525, CUDA 12 toolkit, cuDNN 9
Windows x64 DirectML Any DirectX 12 GPU (most GPUs since 2016), Windows 10+

Provider selection logic

device provider Behavior
cpu (default) Always CPU
auto Try GPU providers for the platform in order; silent CPU fallback
gpu Try GPU providers; throw if none available
any cuda Load CUDA provider; throw if not available or not supported
any dml Load DirectML provider; throw if not available or not supported
any cpu Always CPU

On Linux x64: GPU order is cuda.
On Windows x64: GPU order is cuda → dml.


Testing

CI is enabled via GitHub Actions (.github/workflows/ci.yml) which runs tests and collects coverage on push and pull requests.


Performance Optimizations

How to tune performance?

Embedeer exposes runtime knobs and helper scripts to tune throughput for your host.

  • Pre-load models: run Embedder.loadModel(model, { dtype, cacheDir }) or use the bench scripts so workers start instantly without re-downloading models.
  • Reuse Embedder instances: create a single Embedder and call embed() repeatedly instead of creating and destroying instances per batch.
  • Batch size vs concurrency:
    • CPU: moderate batch sizes (16–64) with multiple workers (concurrency ≥ 2) usually give best throughput.
    • GPU: larger batches (64–256) with low concurrency (1–2) are typically fastest.
  • BLAS threading: avoid oversubscription by setting OMP_NUM_THREADS and MKL_NUM_THREADS to Math.floor(cpu_cores / concurrency) before starting workers.
  • Device/provider: use cuda on Linux and dml (DirectML) on Windows when available; device: 'auto' will try providers and fall back to CPU.

Automatic performance tuning

  • Automatic tuning: use bench/grid-search.js to sweep batchSize, concurrency, and dtype for your host and save results. You can generate and persist a per-user profile and apply it automatically via the Embedder APIs.

Examples:

# CPU quick grid
node bench/grid-search.js --device cpu --sample-size 200 --out bench/grid-results-cpu.json

# GPU quick grid
node bench/grid-search.js --device gpu --sample-size 100 --out bench/grid-results-gpu.json

Programmatic performance tuning

You can generate and save a per-user performance profile which Embedder.create() will automatically apply. This is useful to pick the best batchSize / concurrency for your machine without manual tuning.

import { Embedder } from '@jsilvanus/embedeer';

// Quick profile generation (writes ~/.embedeer/perf-profile.json)
await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Subsequent calls to Embedder.create() will auto-apply the saved profile by default.

Server mode benchmark

Compare socket and gRPC server throughput against the process/thread baseline:

npm run server-bench
# or with options:
node bench/server-bench.js --model Xenova/all-MiniLM-L6-v2 --batch-size 32 --sample-size 500

The benchmark starts each server as a subprocess, waits for it to load the model, runs embeddings, then shuts it down. Reports startup time (spawn → ready) separately from embedding throughput so you can see the fixed cost of model loading vs. steady-state performance.

Options:
  --model       <name>   HF model identifier  (default: Xenova/all-MiniLM-L6-v2)
  --batch-size  <n>      Texts per request     (default: 32)
  --dtype       <type>   Quantization dtype    (default: none)
  --sample-size <n>      Number of texts       (default: 200)
  --skip-socket          Skip socket runner
  --skip-grpc            Skip gRPC runner
  --skip-baseline        Skip process/thread baseline

License

MIT

About

A Node.js model tool, which supports embedding with batched input, parallel execution, isolated child-process workers (default) or in-process threads, quantization, optional GPU acceleration, and Hugging Face auth.

Topics

Resources

Stars

Watchers

Forks

Contributors