Skip to content

KV cache pre-build for RLV: question latency 35s → 5s #83

@unamedkr

Description

@unamedkr

Summary

Pre-compute and persist KV caches for each document chunk during indexing, eliminating prefill overhead at query time. This is the single highest-impact speed optimization for RLV.

Current State

Each RLV question requires:

Locator (BM25):    0.01s  ← fast
Lookup (LLM):     15-20s  ← prefill 8s + generate 7s
Verifier:          0-8s
Total:            ~35s/question

The 8s prefill is spent re-reading the same chunk text every time. For 20 questions on the same document, we prefill the same chunks ~20 times.

Proposed Solution

# One-time indexing (slow, ~5min for 1.3MB doc)
quantcpp index document.txt --output document.kv/

# Per-question (fast, ~5s)
quantcpp rlv --index document.kv/ "Who directed Mercury Fur?"

Implementation:

# During indexing:
for chunk in gist.chunks:
    ctx = quant_new(model, config)
    quant_generate(ctx, chunk.text, null_callback, null)  # prefill only
    quant_save_context(ctx, f"document.kv/chunk_{chunk.id}.kv")

# During query:
ctx = quant_new(model, config)
quant_load_context(ctx, f"document.kv/chunk_{best_id}.kv")  # instant
quant_generate(ctx, question, on_token, data)  # generate only (~5s)

Impact

Metric Before After
Per-question latency 35s ~5s
20-question benchmark 12min ~2min
First-question latency 35s 35s (indexing amortized)

quant.cpp Advantage

save_context/load_context is unique to quant.cpp — no other inference engine provides this. Combined with KV compression (6.4x), each chunk's cache is only a few hundred KB on disk.

Priority: P0

This is the difference between "demo" and "usable product". 35s/question is a demo; 5s/question is a tool people actually use.


Proposed by ClawTeam based on RLV Day 5 benchmarking

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions