Llamaj.cpp is a Java and JVM port of llama.cpp using jextract, enabling local large language model (LLM) inference through native foreign function & memory API interop. Natively supports macOS M-series and Linux x86_64 with GPU acceleration. Platform and hardware support (Windows, ARM, CUDA, etc.) can be extended through custom builds.
llama.cpp · java · jvm · llm · large language models · inference · ai · native interop · foreign function & memory api · jextract
- Java 25
- mvn
- MacOS M-series / Linux x86_64 (CPU) (you can check the last section if you do not see your platform here)
Include the dependency in your pom.xml
<dependencies>
...
<dependency>
<groupId>io.gravitee.llama.cpp</groupId>
<artifactId>llamaj.cpp</artifactId>
<version>x.x.x</version>
</dependency>
</dependencies>
Note: All examples below use
LlamaVocabto handle tokenization. It's obtained from a loadedLlamaModeland is essential for converting between tokens and text representations.
import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class BasicExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
// Initialize runtime
LlamaRuntime.llama_backend_init();
// Load model
var modelParams = new LlamaModelParams(arena);
var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);
// Create context
var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
var context = new LlamaContext(model, contextParams);
// Set up tokenizer and sampler
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena)
.temperature(0.7f)
.topK(40)
.topP(0.9f, 1)
.seed(42);
// Create conversation state
var state = ConversationState.create(arena, context, tokenizer, sampler, 0)
.setMaxTokens(100)
.initialize("What is the capital of France?");
// Generate response
var iterator = new DefaultLlamaIterator(state);
while (iterator.hasNext()) {
var output = iterator.next();
System.out.print(output.text());
}
// Cleanup
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}Enable log-probability collection to inspect the model's confidence at each token position.
Set topLogprobs to the number of top-alternative tokens you want alongside the sampled one (0 = disabled, no overhead):
import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class LogprobsExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
LlamaRuntime.llama_backend_init();
var model = new LlamaModel(arena, Path.of("models/model.gguf"), new LlamaModelParams(arena));
var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
var context = new LlamaContext(arena, model, contextParams);
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);
var state = ConversationState.create(arena, context, tokenizer, sampler)
.setMaxTokens(50)
.setTopLogprobs(5) // return top-5 alternatives at every token position
.initialize("What is the capital of France?");
var iterator = new DefaultLlamaIterator(state);
while (iterator.hasNext()) {
var output = iterator.next();
System.out.print(output.text());
Logprobs lp = output.logprobs();
System.out.printf("%n chosen: \"%s\" logprob=%.4f%n",
lp.chosenToken().token(), lp.chosenToken().logprob());
lp.topLogprobs().forEach(t ->
System.out.printf(" alt: \"%s\" logprob=%.4f%n", t.token(), t.logprob()));
}
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}Each LlamaOutput carries a Logprobs object with:
chosenToken()— the token that was sampled, its text, vocabulary ID, log-probability, and raw UTF-8 bytestopLogprobs()— up to N alternatives sorted by descending log-probability; the chosen token is always included
When topLogprobs is 0 (the default), output.logprobs() is null and no logit processing is done.
Process multiple conversations simultaneously in a single batch:
import io.gravitee.llama.cpp.*;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class ParallelExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
// Initialize runtime
LlamaRuntime.llama_backend_init();
// Load model
var modelParams = new LlamaModelParams(arena);
var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);
// Create context with multi-sequence support
var contextParams = new LlamaContextParams(arena)
.nCtx(2048)
.nBatch(512)
.nSeqMax(4); // Support up to 4 parallel conversations
var context = new LlamaContext(model, contextParams);
// Set up shared tokenizer and sampler
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);
// Create multiple conversation states with unique sequence IDs
var state1 = ConversationState.create(arena, context, tokenizer, sampler, 0)
.setMaxTokens(100).initialize("What is the capital of France?");
var state2 = ConversationState.create(arena, context, tokenizer, sampler, 1)
.setMaxTokens(100).initialize("What is the capital of England?");
var state3 = ConversationState.create(arena, context, tokenizer, sampler, 2)
.setMaxTokens(100).initialize("What is the capital of Poland?");
// Create parallel iterator - prompts are auto-processed when states are added
var parallel = new BatchIterator(arena, context, 512, 4)
.addState(state1)
.addState(state2)
.addState(state3);
// Generate tokens in parallel
System.out.println("=== Parallel Generation ===");
while (parallel.hasNext()) {
// Each hasNext() generates tokens for all active conversations
// Get all outputs from this batch (one per active conversation)
var outputs = parallel.getOutputs();
for (var output : outputs) {
System.out.println("Seq " + output.sequenceId() + ": " + output.text());
}
}
System.out.println();
// Print results
System.out.println("Conversation 1: " + state1.getAnswer());
System.out.println(" Tokens: " + state1.getAnswerTokens());
System.out.println("Conversation 2: " + state2.getAnswer());
System.out.println(" Tokens: " + state2.getAnswerTokens());
System.out.println("Conversation 3: " + state3.getAnswer());
System.out.println(" Tokens: " + state3.getAnswerTokens());
// Cleanup
parallel.free();
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}Offload model weights and KV-cache to remote machines using the RPC backend.
When using --rpc, weights are loaded exclusively on the remote servers -- the local GPU is not used.
Start RPC server nodes first (see containers/README.md):
# On the remote machine (or another terminal)
./scripts/start-rpc-server.shThen connect from Java:
import io.gravitee.llama.cpp.*;
import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;
import java.lang.foreign.Arena;
import java.nio.file.Path;
public class RpcExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
// Initialize runtime
String libPath = LlamaLibLoader.load();
LlamaRuntime.llama_backend_init();
// Register remote RPC servers -- returns their device handles
var rpcDevices = BackendRegistry.addRpcServer(arena, "127.0.0.1:50052");
// Print all discovered backends and devices
BackendRegistry.printSummary();
// Load model, restricting offloading to only the RPC devices
var modelParams = new LlamaModelParams(arena)
.devices(arena, rpcDevices)
.nGpuLayers(999);
var model = new LlamaModel(arena, Path.of("models/model.gguf"), modelParams);
// Everything else works exactly the same as local inference
var contextParams = new LlamaContextParams(arena).nCtx(2048).nBatch(512);
var context = new LlamaContext(model, contextParams);
var vocab = new LlamaVocab(model);
var tokenizer = new LlamaTokenizer(vocab, context);
var sampler = new LlamaSampler(arena).temperature(0.7f).seed(42);
var state = ConversationState.create(arena, context, tokenizer, sampler, 0)
.setMaxTokens(100)
.initialize("What is the capital of France?");
var iterator = new DefaultLlamaIterator(state);
while (iterator.hasNext()) {
System.out.print(iterator.next().text());
}
context.free();
sampler.free();
model.free();
LlamaRuntime.llama_backend_free();
}
}Or from the CLI:
$ java --enable-preview --enable-native-access=ALL-UNNAMED \
-jar llamaj.cpp-<version>.jar \
--model models/model.gguf \
--rpc 127.0.0.1:50052Multiple RPC servers:
$ java --enable-preview --enable-native-access=ALL-UNNAMED \
-jar llamaj.cpp-<version>.jar \
--model models/model.gguf \
--rpc 192.168.1.10:50052,192.168.1.11:50052LlamaReranker is a cross-encoder wrapper that scores how relevant a document is to a
query. Pooling and attention are auto-detected from the GGUF architecture, so
Options.defaults() works for most reranker models. Use score for a single pair and
scoreAll to rank a batch of documents against one query.
import io.gravitee.llama.cpp.*;
import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;
import java.lang.foreign.Arena;
import java.nio.file.Path;
import java.util.Comparator;
import java.util.List;
public class RerankExample {
public static void main(String[] args) {
var arena = Arena.ofConfined();
String libPath = LlamaLibLoader.load();
LlamaRuntime.llama_backend_init();
LlamaRuntime.ggml_backend_load_all_from_path(arena, libPath);
var model = new LlamaModel(arena, Path.of("models/reranker.gguf"), new LlamaModelParams(arena));
var reranker = new LlamaReranker(arena, model, LlamaReranker.Options.defaults());
String query = "What is the capital of France?";
List<String> documents = List.of(
"Paris is the capital and most populous city of France.",
"The Eiffel Tower is a landmark in Paris.",
"Bananas are a good source of potassium."
);
// One raw score array per document, in input order.
List<float[]> scores = reranker.scoreAll(query, documents);
// Rank documents by their first logit, highest first.
java.util.stream.IntStream.range(0, documents.size())
.boxed()
.sorted(Comparator.comparingDouble((Integer i) -> scores.get(i)[0]).reversed())
.forEach(i -> System.out.printf("%.4f %s%n", scores.get(i)[0], documents.get(i)));
reranker.close(); // frees only the context; the model is owned by the caller
model.free();
LlamaRuntime.llama_backend_free();
arena.close();
}
}score(query, document) returns a float[] whose size is reranker.nClsOut() — typically
1 for BERT-style cross-encoders and 2 for chat-style rerankers. For models that need a
structured prompt, supply a RerankTemplate (a (query, document) -> String lambda) via
Options.defaults().withTemplate(...); the default is RerankTemplate.PLAIN
(query + " " + document).
Chat-style rerankers like Qwen3-Reranker expect the (query, document) pair wrapped in a
system/user prompt that asks the model to judge relevance. Provide that format as a
RerankTemplate. Unlike a BERT cross-encoder (which emits a single raw logit), Qwen3-Reranker
has a 2-class classifier head: nClsOut() == 2 and score returns a softmax float[2] where
index 0 is P(relevant). That probability is still a perfectly good ranking score — sort by
it descending to rerank a candidate set.
import io.gravitee.llama.cpp.*;
import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;
import java.lang.foreign.Arena;
import java.nio.file.Path;
import java.util.Comparator;
import java.util.List;
import java.util.stream.IntStream;
public class Qwen3RerankExample {
// Wraps the pair in Qwen3's chat format, instructing the model to answer yes/no.
static final RerankTemplate QWEN3_TEMPLATE = (query, document) ->
"<|im_start|>system\n" +
"Judge whether the Document is relevant to the Query. Answer 'yes' or 'no'." +
"<|im_end|>\n" +
"<|im_start|>user\n" +
"Query: " + query + "\n" +
"Document: " + document + "\n" +
"Relevant:<|im_end|>\n";
public static void main(String[] args) {
var arena = Arena.ofConfined();
String libPath = LlamaLibLoader.load();
LlamaRuntime.llama_backend_init();
LlamaRuntime.ggml_backend_load_all_from_path(arena, libPath);
var model = new LlamaModel(arena, Path.of("models/reranker.gguf"), new LlamaModelParams(arena));
var reranker = new LlamaReranker(
arena,
model,
LlamaReranker.Options.defaults().withTemplate(QWEN3_TEMPLATE)
);
String query = "What is the capital of France?";
List<String> documents = List.of(
"Paris is the capital and most populous city of France.",
"The Eiffel Tower is a landmark in Paris.",
"Bananas are a good source of potassium."
);
List<float[]> scores = reranker.scoreAll(query, documents);
// Rank by P(relevant) = score[0], highest first.
IntStream.range(0, documents.size())
.boxed()
.sorted(Comparator.comparingDouble((Integer i) -> scores.get(i)[0]).reversed())
.forEach(i -> System.out.printf("P(relevant)=%.4f %s%n", scores.get(i)[0], documents.get(i)));
reranker.close();
model.free();
LlamaRuntime.llama_backend_free();
arena.close();
}
}Classifier probabilities vs. raw scores. Qwen3-Reranker's
score[0]is a calibrated probability in[0, 1]. BERT cross-encoders (nClsOut() == 1, Example 5) instead return a single unbounded logit — larger means more relevant, but it is not a probability. Ranking works the same way for both (sort descending); only applysigmoid(logit)if you specifically need a[0, 1]score from a single-logit model.
LlamaEmbedder turns text into a dense vector. Pooling and attention are auto-detected from
the GGUF architecture (CLS/NON_CAUSAL for encoders, LAST/CAUSAL for decoders), so
Options.defaults() covers most models. Use embed for a single string or embedAll to
batch many texts through a single decode. Vectors are returned un-normalised — L2-normalise
them yourself before computing cosine similarity.
import io.gravitee.llama.cpp.*;
import io.gravitee.llama.cpp.nativelib.LlamaLibLoader;
import java.lang.foreign.Arena;
import java.nio.file.Path;
import java.util.List;
public class EmbeddingExample {
static float[] l2normalize(float[] v) {
double sum = 0;
for (float f : v) sum += (double) f * f;
double norm = Math.sqrt(sum);
if (norm > 1e-9) for (int i = 0; i < v.length; i++) v[i] = (float) (v[i] / norm);
return v;
}
static double cosine(float[] a, float[] b) {
double dot = 0;
for (int i = 0; i < a.length; i++) dot += (double) a[i] * b[i];
return dot; // inputs are L2-normalised, so the dot product is the cosine
}
public static void main(String[] args) {
var arena = Arena.ofConfined();
String libPath = LlamaLibLoader.load();
LlamaRuntime.llama_backend_init();
LlamaRuntime.ggml_backend_load_all_from_path(arena, libPath);
var model = new LlamaModel(arena, Path.of("models/embedding.gguf"), new LlamaModelParams(arena));
var embedder = new LlamaEmbedder(arena, model, LlamaEmbedder.Options.defaults());
System.out.println("embedding dimension: " + embedder.nEmbdOut());
List<float[]> embeddings = embedder.embedAll(List.of(
"The capital of France is Paris.",
"Paris is France's largest city.",
"Bananas are a good source of potassium."
));
embeddings.forEach(EmbeddingExample::l2normalize);
System.out.printf("similar pair: %.4f%n", cosine(embeddings.get(0), embeddings.get(1)));
System.out.printf("unrelated pair: %.4f%n", cosine(embeddings.get(0), embeddings.get(2)));
embedder.close();
model.free();
LlamaRuntime.llama_backend_free();
arena.close();
}
}The build uses a platform-specific Maven profile to download the correct jextract tool and pre-built llama.cpp native libraries, generate the Java FFM bindings, format the code, apply license headers, and install the artifact to your local Maven repository.
macOS (Apple Silicon):
cd llamaj.cpp/
mvn prettier:write license:format clean generate-sources -Pmacosx-aarch64 installLinux (x86_64):
cd llamaj.cpp/
mvn prettier:write license:format clean generate-sources -Plinux-x86_64 installOn Linux, you also need to set the library path at runtime:
export LD_LIBRARY_PATH="$HOME/.llama.cpp:$LD_LIBRARY_PATH"
$ mvn exec:java -Dexec.mainClass=io.gravitee.llama.cpp.Main \
-Dexec.args="--model /path/to/model/model.gguf --system 'You are a helpful assistant. Answer question to the best of your ability'"or
$ java --enable-preview -jar llamaj.cpp-<version>.jar \
--model models/model.gguf \
--system 'You are a helpful assistant. Answer question to the best of your ability'On linux, don't forget to link your libraries with the environment variable below:
$ export LD_LIBRARY_PATH="$HOME/.llama.cpp:$LD_LIBRARY_PATH"There are plenty of models on HuggingFace, we suggest the one here
Usage: java -jar llamaj.cpp-<version>.jar --model <path_to_gguf_model> [options...]
Options:
--system <message> : System message (default: "You are a helpful AI assistant.")
--n_gpu_layers <int> : Number of GPU layers (default: 999)
--use_mlock <boolean> : Use mlock (default: true)
--use_mmap <boolean> : Use mmap (default: true)
--rpc <endpoints> : Comma-separated RPC server endpoints for distributed inference
(e.g., "127.0.0.1:50052,192.168.1.11:50052")
When set, weights are offloaded exclusively to the remote servers
--temperature <float> : Sampler temperature (default: 0.4)
--min_p <float> : Sampler min_p (default: 0.1)
--min_p_window <int> : Sampler min_p_window (default: 40)
--top_k <int> : Sampler top_k (default: 10)
--top_p <float> : Sampler top_p (default: 0.2)
--top_p_window <int> : Sampler top_p_window (default: 10)
--seed <long> : Sampler seed (default: random)
--n_ctx <int> : Context size (default: 512)
--n_batch <int> : Batch size (default: 512)
--n_seq_max <int> : Max sequence length (default: 512)
--quota <int> : Iterator quota (default: 512)
--n_keep <int> : Tokens to keep when exceeding ctx size (default: 256)
--log_level <level> : Logging level (ERROR, WARN, INFO, DEBUG, default: ERROR)
- Clone
llama.cpprepository
Make sure the jextract folder is in the same path level as your repository
$ git clone https://github.com/ggml-org/llama.cpp
$ cd llama.cpp- Compile sources
Make sure you have gcc / g++ compiler
$ gcc --help
$ g++ --helpOn Linux:
$ cmake -B build
$ cmake --build build --config Release -j $(nproc) On MacOs:
$ cmake -B build
$ cmake --build build --config Release -j $(sysctl -n hw.ncpu)If you wish to build llama.cpp with particular configuration (CUDA, OpenBLAS, AVX2, AVX512, ...) Please refer to the llama.cpp documentation
- Link sources
You can use the environment variable LLAMA_CPP_LIB_PATH=/path/to/llama.cpp/build/bin/
This will directly load the dynamically shared object library files (.so for linux, .dylib for macos)
You can also decide to copy these files into a temporary folder using the environment variable LLAMA_CPP_USE_TMP_LIB_PATH=true
The path temporary file will be used to load the shared object libraries
To add support for other platforms (Windows, ARM, CUDA, etc.), follow this approach:
Clone and build llama.cpp for your target platform:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build --config ReleaseDownload jextract for your platform from OpenJDK early-access builds, then generate the Java bindings:
# Example for Windows x86_64
jextract -t io.gravitee.llama.cpp.windows.x86_64 \
--include-dir /path/to/llama.cpp/ggml/include \
--include-dir /path/to/llama.cpp/include \
--output src/main/java \
--header-class-name llama_h \
/path/to/llama.cpp/tools/mtmd/mtmd.h \
/path/to/llama.cpp/tools/mtmd/mtmd-helper.h \
/path/to/llama.cpp/include/llama.h \
/path/to/llama.cpp/ggml/include/ggml-rpc.hCheck the generated sources and apply any necessary fixes (e.g., visibility modifiers, fully-qualified method calls).
Compile the generated sources and build a JAR using your own build system (Maven, Gradle, etc.).
Add the generated JAR to your project's classpath and ensure the native libraries from step 1 are available at runtime.