Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
8d30ad7
Add HTTP intercept proxy for transparent LLM request reordering
SecretSettler Feb 26, 2026
5d20ecb
Add OpenClaw integration: tool_result reordering, markdown_header mod…
SecretSettler Feb 26, 2026
f526544
Fix proxy breaking OpenClaw tool loop: move metadata to headers, broa…
SecretSettler Mar 1, 2026
0c35e26
Merge branch 'sglang-monkeypatch' into http_intercept
SecretSettler Mar 1, 2026
f328d6a
Sync ContextPilot request_ids with SGLang rids in proxy and intercept…
SecretSettler Mar 1, 2026
9c773af
Add DEBUG Log optionas
SecretSettler Mar 1, 2026
ad7170d
update openclaw readme
SecretSettler Mar 1, 2026
5434e2e
Update openclaw docs
SecretSettler Mar 1, 2026
c6d8792
Update openclaw README
SecretSettler Mar 1, 2026
378ee0a
Support http intercept proxy
SecretSettler Mar 2, 2026
66deff2
Merge branch 'main' into http_intercept
SecretSettler Mar 4, 2026
c38f6d7
Merge branch 'main' into http_intercept
SecretSettler Mar 4, 2026
66a5a64
Merge branch 'main' into http_intercept
SecretSettler Mar 5, 2026
442beb9
update openclaw example
SecretSettler Mar 6, 2026
39f9bd7
Add http intercept support
SecretSettler Mar 8, 2026
6b505b4
Fix http intercept bugs
SecretSettler Mar 9, 2026
56a88e1
docs: add Context Optimizer architecture design + SVG diagram
SecretSettler Mar 10, 2026
dd7353e
Update architecture SVG: add _compat bridge, pipeline/utils, refactor…
SecretSettler Mar 10, 2026
e4efe06
Redesign SVG: horizontal layout (landscape, left-to-right data flow)
SecretSettler Mar 10, 2026
a46d1c5
SVG: resize to 950x680, larger fonts for Notion readability
SecretSettler Mar 10, 2026
de8a766
SVG: 全中文标注,每步说明干什么
SecretSettler Mar 10, 2026
af3218e
SVG v5: flow-oriented vertical layout, English labels, 880x720, shows…
SecretSettler Mar 10, 2026
8ca341e
Add PNG version of architecture diagram for Notion compatibility
SecretSettler Mar 10, 2026
b10ac5a
SVG v6: all rules applied, add KV Engine layer (SGLang radix tree, kv…
SecretSettler Mar 10, 2026
f59003e
SVG v7: clarify dedup (same-session) vs prefix cache (cross-session),…
SecretSettler Mar 10, 2026
da0ab19
docs: update design docs to 4-primitive architecture (Dedup, Add, Rep…
SecretSettler Mar 13, 2026
d56ab67
docs: fix primitives — all READ-only, Add=prefix hit with cache-only …
SecretSettler Mar 13, 2026
31248b5
svg: fix Add primitive — READ prefix-hit blocks (not INSERT)
SecretSettler Mar 13, 2026
a1d1a56
docs: fix execution order — Dedup → Repartition → Add → Reorder (sequ…
SecretSettler Mar 13, 2026
c5b9103
docs: add skip conditions — each primitive may be no-op when not appl…
SecretSettler Mar 13, 2026
99e19d6
svg: add skip conditions — dashed borders + 'skip if...' for each pri…
SecretSettler Mar 13, 2026
ef3e6d6
Merge branch 'main' into http_intercept
SecretSettler Mar 14, 2026
43f50d7
feat: cloud prompt cache proxy for Anthropic, OpenAI, MiniMax
SecretSettler Mar 17, 2026
9907f40
Add block-level dedup, OpenClaw integration, cache sync docs
SecretSettler Mar 22, 2026
87230df
Add P99 wall time to OpenClaw benchmark table
SecretSettler Mar 22, 2026
488a9a0
Clarify prompt tokens is per request
SecretSettler Mar 22, 2026
fd0f568
Add cost saving estimate at GPT-5.4 rates
SecretSettler Mar 22, 2026
baa1e79
Fix cost estimate to realistic volume (100 tasks/day)
SecretSettler Mar 22, 2026
b2589b2
Adjust to 5-person team, 50 tasks/day
SecretSettler Mar 22, 2026
11b63a6
Replace specific cost estimate with general value proposition
SecretSettler Mar 22, 2026
d714e41
Add per-person cost savings across model tiers
SecretSettler Mar 22, 2026
cb775c7
Remove cost estimate
SecretSettler Mar 22, 2026
75243f6
Add news, reorder OpenClaw install first, add how_it_works.md
SecretSettler Mar 22, 2026
e0cf185
Simplify wording: supports X, Y, Z
SecretSettler Mar 22, 2026
1773bef
Fix news wording
SecretSettler Mar 22, 2026
483980c
Clean up news wording
SecretSettler Mar 22, 2026
f5976ba
Integrate block dedup into conversation_tracker.deduplicate()
SecretSettler Mar 22, 2026
fb5e58c
Make docs optional — derive from doc_contents keys if not provided
SecretSettler Mar 22, 2026
7251076
Wire block dedup into /reorder API — pass doc_contents when string in…
SecretSettler Mar 22, 2026
025a57c
Fix /reorder API: sync deduped content back to output, fix id_to_str …
SecretSettler Mar 22, 2026
8f54618
Add pipeline diagram to OpenClaw guide
SecretSettler Mar 23, 2026
4682f41
Fix image filename to openclaw-cp.png
SecretSettler Mar 23, 2026
c7502e6
Move pipeline diagram to top, replace ASCII architecture
SecretSettler Mar 23, 2026
4ccc4a4
Reorder pipeline steps: align prefix cache first, then dedup, then re…
SecretSettler Mar 23, 2026
edc04ab
Simplify: reorder + dedup are the two operations, prefix alignment is…
SecretSettler Mar 23, 2026
3ae346c
Rename: file-level → tool-level, block-level → content-level dedup
SecretSettler Mar 23, 2026
ea74a8b
Rename tool-level → document-level (works for both agentic and RAG)
SecretSettler Mar 23, 2026
3c30864
Rename content-level → ContextBlock-level dedup
SecretSettler Mar 23, 2026
07e33f9
Fix naming: ContextBlock-level (whole doc) + Content-level (block dedup)
SecretSettler Mar 23, 2026
9237c3b
Update pipeline diagram
SecretSettler Mar 23, 2026
059275c
Delete unecessary images
SecretSettler Mar 23, 2026
1fac931
Add --chunk-modulus CLI flag for tuning content-level dedup block size
SecretSettler Mar 23, 2026
13fb948
Add OpenClaw quick start as first example in Getting Started
SecretSettler Mar 23, 2026
d337173
Show real OpenClaw usage example with contract analysis conversation
SecretSettler Mar 23, 2026
dce540e
Show openclaw agent CLI commands in example
SecretSettler Mar 23, 2026
602e1f3
Move RAG benchmarks to docs/benchmarks/rag.md, keep OpenClaw + Mem0 +…
SecretSettler Mar 23, 2026
4e3fdf4
Add 5 long context memory result to Mem0 table
SecretSettler Mar 23, 2026
1d06c8b
Update mem0.md model to Qwen3-4B tp=1
SecretSettler Mar 23, 2026
be031a0
Move RAG reference to end of Performance section
SecretSettler Mar 23, 2026
f368201
Add M5 MacBook Air results to Apple Silicon benchmark table
SecretSettler Mar 23, 2026
02da78a
Use deduplication instead of dedup in how_it_works.md
SecretSettler Mar 23, 2026
53ad199
Fix setup table header and remove dead raw data link
SecretSettler Mar 23, 2026
8131f0f
Replace jargon: arms → with and without ContextPilot
SecretSettler Mar 23, 2026
9e4b450
Rewrite docs index: add OpenClaw, how_it_works, cache_sync, benchmark…
SecretSettler Mar 23, 2026
3045101
Remove dedup section from cache_sync.md — belongs in how_it_works.md
SecretSettler Mar 24, 2026
70b8bab
Fix reorder example: both requests must start with shared prefix
SecretSettler Mar 24, 2026
843f642
Fix reorder example: only Request 2 is reordered, shared docs moved t…
SecretSettler Mar 24, 2026
f097372
Fix 13 failing tests: update for new TTL API (request_id), dedup pipe…
SecretSettler Mar 24, 2026
e5fe5c3
Extend block dedup to assistant code blocks — dedup repeated code acr…
SecretSettler Mar 26, 2026
041f46c
Fix bugs from PR review: dedup join corruption, hash determinism, TTL…
SecretSettler Mar 29, 2026
68088f3
Bump version to 0.4.0 and add CHANGELOG entry
SecretSettler Mar 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 37 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,43 @@ All notable changes to ContextPilot will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## [0.4.0] - 2026-03-29

### Added
- **Cloud prompt cache proxy** for Anthropic, OpenAI, and MiniMax — transparent prefix caching over cloud APIs
- **HTTP intercept proxy** — drop-in reverse proxy that extracts, reorders, and deduplicates documents in LLM requests without client changes
- **Block-level dedup** — content-defined chunking within tool results and assistant code blocks to deduplicate repeated content across turns
- **OpenClaw integration** — tool_result reordering, `markdown_header` extraction mode, deployment files, and quick-start guide
- **TTL-based cache eviction** policy with configurable tiers and automatic expiry
- **Conversation tracker** for multi-turn state: parent chain tracking, per-turn document history, and cross-turn block dedup
- `--chunk-modulus` CLI flag for tuning content-level dedup block size
- Cache sync documentation and `how_it_works.md` guide
- Pipeline diagram and architecture SVGs
- M5 MacBook Air results to Apple Silicon benchmark table
- P99 wall time to OpenClaw benchmark table

### Changed
- Renamed dedup levels: file-level → document-level, block-level → content-level, content-level → ContextBlock-level
- Intercept parser supports multiple extraction formats (XML, numbered, separator, JSON results) with auto-detection
- Cloud adapters inject `cache_control` breakpoints on system prompts and tool results (limited to 4 per Anthropic API)
- Proxy forwards request metadata via headers instead of body to avoid breaking tool loops

### Fixed
- Block dedup `"\n\n".join` corrupting content at chunk boundaries (phantom blank lines)
- `hash()` non-determinism in content-defined chunking — replaced with `hashlib.md5`
- `_chunk_modulus` missing from global declaration (CLI flag silently ignored)
- Proxy hardcoding `temperature=0` overwriting user values — now uses `setdefault`
- `default_ttl_seconds=0` silently becoming 300 (falsy `or` → `is not None`)
- `default_ttl` setter not syncing `_default_ttl_seconds`
- `update_from_response` double-counting partial cache hits
- Reconstruction functions using default config instead of original extraction config
- API key leak in error responses from `aiohttp.ClientError`
- Non-JSON upstream error crashing with `JSONDecodeError`
- Streaming connection leak on client disconnect (missing `finally` cleanup)
- Redundant `copy.deepcopy` doubling memory pressure per request
- Cycle detection added to `get_conversation_chain`
- Alpha header validation (non-numeric no longer crashes)

## [0.3.5.post2] - 2026-03-05

### Added
Expand Down
77 changes: 51 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,8 @@

## News

- [2026/03] Supports [OpenClaw](https://openclaw.ai) — [guide](docs/guides/openclaw.md) | [benchmark](docs/benchmarks/openclaw.md)
- [2026/03] Supports cloud APIs (OpenAI, Anthropic, MiniMax) — [cache sync](docs/guides/cache_sync.md)
- [2026/03] ContextPilot now can run on **macOS / Apple Silicon** via [llama.cpp](docs/guides/mac_llama_cpp.md).
- [2026/02] ContextPilot v0.3.2 released, supporting [PageIndex](https://github.com/VectifyAI/PageIndex) and [Mem0](https://github.com/mem0ai/mem0).
- [2026/01] ContextPilot has been accepted to MLSys 2026 🎉! See you in Bellevue, WA, USA.
Expand All @@ -28,7 +30,7 @@ Long-context workloads (RAG, memory chat, tool-augmented agents) prepend many co
ContextPilot sits between context assembly and inference to maximize prefix reuse and remove duplicates:

1. **Higher throughput & cache hits** — boosts prefill throughput and prefix cache hit ratio via context reuse.
2. **Drop-in solutions** — works with [PageIndex](https://github.com/VectifyAI/PageIndex), [Mem0](https://github.com/mem0ai/mem0), [LMCache](https://github.com/LMCache/LMCache), and backends like [vLLM](https://github.com/vllm-project/vllm) / [SGLang](https://github.com/sgl-project/sglang) / [llama.cpp](docs/guides/mac_llama_cpp.md).
2. **Drop-in solutions** — supports [OpenClaw](https://openclaw.ai) ([guide](docs/guides/openclaw.md)), [PageIndex](https://github.com/VectifyAI/PageIndex), [Mem0](https://github.com/mem0ai/mem0), [LMCache](https://github.com/LMCache/LMCache), [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](docs/guides/mac_llama_cpp.md), and cloud APIs (OpenAI, Anthropic).
3. **No compromise in reasoning quality** — can even improve with extremely long contexts.
4. **Widely tested** — validated across diverse RAG and agentic workloads.

Expand All @@ -42,53 +44,63 @@ It maintains a **Context Index** of cached content, then per request applies **R

## Performance at a Glance

ContextPilot is validated across three representative settings: single-node academic RAG, multi-node production MoE inference, and multi-turn memory-augmented chat. In every case it delivers significant speedups with comparable answer quality.
**OpenClaw Agent on RTX 5090** — 60 enterprise document analysis tasks ([claw-tasks](https://github.com/EfficientContext/ClawTasks)), Qwen3-4B-Instruct via SGLang. [Full results →](docs/benchmarks/openclaw.md)

**Qwen3-32B on 4×A6000** — single-node academic RAG with a 32B model on consumer GPUs.

| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |
|-----------|--------|--------------------|-----------|--------|
| MultihopRAG | SGLang | 7,290 | 4.64% | 60.42 |
| | **SGLang + ContextPilot** | **14,214** | **33.97%** | **64.39** |
| NarrativeQA | SGLang | 7,921 | 5.91% | 28.41 |
| | **SGLang + ContextPilot** | **12,117** | **20.82%** | **29.64** |

**DeepSeek-R1-671B on 16×H20** — production-scale 671B MoE inference on a multi-node GPU cluster.

| Benchmark | Method | Prefill TP (tok/s) | Cache Hit | F1 (%) |
|-----------|--------|--------------------|-----------|--------|
| MultihopRAG | SGLang | 9,636 | 5.12% | 64.15 |
| | **SGLang + ContextPilot** | **17,498** | **60.37%** | **64.68** |
| NarrativeQA | SGLang | 8,687 | 6.08% | 40.20 |
| | **SGLang + ContextPilot** | **13,201** | **38.24%** | **41.08** |
| Metric | OpenClaw + SGLang | + ContextPilot | Δ |
|--------|-------------------|----------------|---|
| Prompt tokens / request (avg) | 45,771 | 33,622 | **-26.5%** |
| Prompt tokens / request (P99) | 92,785 | 51,581 | **-44.4%** |
| Wall time (avg) | 26.1s | 20.8s | **-20.4%** |
| Wall time (P99) | 68.8s | 50.4s | **-26.6%** |
| Accuracy | 245/245 | 245/245 | ✓ |

**Qwen3-4B on 1×A6000** — multi-turn memory chat with [Mem0](https://github.com/mem0ai/mem0) on the [LoCoMo](https://github.com/snap-research/locomo) benchmark.

| Context Size | Method | TTFT (s) | LLM Judge |
|--------------|--------|----------|-----------|
| 5 (long context memory) | SGLang | 0.1051 | 0.418 |
| | **SGLang + ContextPilot** | **0.0548** | 0.414 |
| 100 memories | SGLang | 0.1012 | 0.437 |
| | **SGLang + ContextPilot** | **0.0554** | 0.420 |

>ContextPilot results in mem0 table are without context annotation — an optional feature that adds original importance ranking to reordered context blocks, which can further improve answer quality (see [Paper](https://arxiv.org/abs/2511.03475)).

**Llama-3.2-1B on Apple M3 (MacBook Air, 16 GB)** — MultihopRAG on Apple Silicon with llama.cpp, no GPU server required.
**Llama-3.2-1B on Apple Silicon** — MultihopRAG with llama.cpp, no GPU server required.

| Method | Avg Latency (ms) |
|--------|-----------------|
| llama.cpp | 3,315 |
| **llama.cpp + ContextPilot** | **1,378** |
| Device | Method | Avg Latency (ms) |
|--------|--------|-----------------|
| M3 (MacBook Air, 16 GB) | llama.cpp | 3,315 |
| | **llama.cpp + ContextPilot** | **1,378** |
| M5 (MacBook Air, 32 GB) | llama.cpp | 2,157 |
| | **llama.cpp + ContextPilot** | **911** |

Settings: `Llama-3.2-1B-Instruct-Q4_K_M.gguf`, Metal offload (`-ngl 99`), `--cache-reuse 256`, `--parallel 4`, context 32768 tokens. See the [Mac + llama.cpp guide](docs/guides/mac_llama_cpp.md).

We also evaluated on academic RAG (Qwen3-32B, 4×A6000) and production MoE inference (DeepSeek-R1-671B, 16×H20) — see [RAG benchmarks](docs/benchmarks/rag.md) and [paper](https://arxiv.org/abs/2511.03475).

## Installation

**Requirements:** Python >= 3.10

---

### vLLM / SGLang
### OpenClaw

ContextPilot works with both CPU and GPU backends for building the context index. The `[gpu]` extra enables GPU-accelerated distance computation (via `cupy-cuda12x`) and is faster for large batches; without it, ContextPilot falls back to the CPU backend automatically.
```bash
pip install contextpilot

# Start proxy (points to your LLM backend)
python -m contextpilot.server.http_server \
--port 8765 --infer-api-url http://localhost:30000 # SGLang
# or: --infer-api-url https://api.anthropic.com # Anthropic
# or: --infer-api-url https://api.openai.com # OpenAI
```

Then set OpenClaw's base URL to `http://localhost:8765/v1`. See the [full OpenClaw integration guide](docs/guides/openclaw.md) for UI setup, config file examples, and self-hosted model instructions.

---

### vLLM / SGLang

**From PyPI** — the vLLM and SGLang hooks are installed automatically:
```bash
Expand Down Expand Up @@ -135,6 +147,19 @@ Docker images are also available for both all-in-one and standalone deployment.

## Getting Started

### Quick Start with OpenClaw

```bash
# Ask OpenClaw to analyze vendor contracts (ContextPilot deduplicates shared content automatically)
openclaw agent --message "Read contracts/contract_alpha_cloud.txt and summarize the liability terms."
openclaw agent --message "Read contracts/contract_beta_ai.txt and compare its liability with Alpha."
openclaw agent --message "Read contracts/contract_gamma_security.txt. Rank all three by liability exposure."
```

When the agent reads multiple documents sharing content (contracts from the same template, proposals with shared methodology), ContextPilot automatically deduplicates identical blocks — reducing prefill tokens by ~27% with zero accuracy loss. See the [integration guide](docs/guides/openclaw.md) and [benchmark](docs/benchmarks/openclaw.md).

---

### Quick Start with Context Ordering

Add **one call** (`cp_instance.optimize()`) before inference to rearrange context blocks so that shared content aligns into a common prefix, enabling cache reuse. An importance ranking in the prompt preserves accuracy.
Expand Down
51 changes: 29 additions & 22 deletions contextpilot/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,13 @@

Quick Start:
>>> from contextpilot.pipeline import RAGPipeline
>>>
>>>
>>> pipeline = RAGPipeline(
... retriever="bm25",
... corpus_path="corpus.jsonl",
... model="Qwen/Qwen2.5-7B-Instruct"
... )
>>>
>>>
>>> results = pipeline.run(queries=["What is AI?"])

See docs/reference/api.md for detailed documentation.
Expand All @@ -38,6 +38,12 @@

from .server.live_index import ContextPilot

from .dedup import (
dedup_chat_completions,
dedup_responses_api,
DedupResult,
)

from .api import optimize, optimize_batch

from .retriever import (
Expand All @@ -53,27 +59,28 @@

__all__ = [
# High-level pipeline API
'RAGPipeline',
'RetrieverConfig',
'OptimizerConfig',
'InferenceConfig',
'PipelineConfig',

"RAGPipeline",
"RetrieverConfig",
"OptimizerConfig",
"InferenceConfig",
"PipelineConfig",
# Core components
'ContextIndex',
'IndexResult',
'IntraContextOrderer',
'ContextPilot',

"ContextIndex",
"IndexResult",
"IntraContextOrderer",
"ContextPilot",
# Deduplication
"dedup_chat_completions",
"dedup_responses_api",
"DedupResult",
# Convenience functions
'optimize',
'optimize_batch',

"optimize",
"optimize_batch",
# Retrievers
'BM25Retriever',
'FAISSRetriever',
'FAISS_AVAILABLE',
'Mem0Retriever',
'create_mem0_corpus_map',
'MEM0_AVAILABLE',
"BM25Retriever",
"FAISSRetriever",
"FAISS_AVAILABLE",
"Mem0Retriever",
"create_mem0_corpus_map",
"MEM0_AVAILABLE",
]
11 changes: 11 additions & 0 deletions contextpilot/dedup/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from .block_dedup import (
dedup_chat_completions,
dedup_responses_api,
DedupResult,
)

__all__ = [
"dedup_chat_completions",
"dedup_responses_api",
"DedupResult",
]
Loading
Loading