Skip to content

fix: prevent Tantivy segment accumulation in AsyncSearcher bulk index builds (APP-4767)#13286

Draft
warp-dev-github-integration[bot] wants to merge 1 commit into
masterfrom
oz/mem-triage-tantivy-bulk-index
Draft

fix: prevent Tantivy segment accumulation in AsyncSearcher bulk index builds (APP-4767)#13286
warp-dev-github-integration[bot] wants to merge 1 commit into
masterfrom
oz/mem-triage-tantivy-bulk-index

Conversation

@warp-dev-github-integration

Copy link
Copy Markdown
Contributor

Description

Fixes excessive Tantivy segment accumulation in AsyncSearcher::build_index_async, which caused multi-GB memory spikes for users with many Warp Drive notebooks/workflows.

Root cause: build_index_async previously sent one SearcherEvent::DocumentInserted per document. With a 75 ms batch window and up to 100 events per batch, a user with 200 notebooks would generate 2+ separate commit() calls during each index rebuild — each commit creating a new in-RAM Tantivy segment. Sentry breadcrumbs confirmed segment accumulation reaching "Prepared commit 23" (23 segments), driving total heap usage to 8–11 GB.

Fix: Introduces SearcherEvent::BulkDocumentsInserted(Vec<FullTextSearchDocumentEntry>) and rewrites build_index_async to collect all documents into a single Vec and send them as one channel message. The background consumer processes the entire Vec in a single execute_operations call → one commit() → one Tantivy segment, regardless of how many documents are indexed. insert_document_async (used for low-frequency incremental updates) is unchanged.

This is a complementary fix to #12819 (which reduces per-document size via content truncation + budget reductions). Together they prevent both the root causes of segment bloat.

Linked Issue

Testing

  • No new tests required — the change is an internal batching optimization that does not alter the observable search API. The existing warp_search_core searcher tests continue to pass.
  • cargo check and cargo clippy -p warp_search_core -- -D warnings pass with no errors or warnings.
  • I have manually tested my changes locally with ./script/run

Agent Mode

  • Warp Agent Mode - This PR was created via Warp's AI Agent Mode

Conversation: https://staging.warp.dev/conversation/3d7980ab-21db-484a-8d10-108238390087
Run: https://oz.staging.warp.dev/runs/019f1ef4-435d-7d06-97a4-d8eac0e04d7b
This PR was generated with Oz.

… builds (APP-4767)

build_index_async previously sent one SearcherEvent::DocumentInserted per
document, so with a 75ms batch window each batch of ≤100 docs triggered a
separate commit() and created a separate Tantivy segment.  Users with many
Warp Drive objects (notebooks, workflows, env-vars) accumulated 10–23+
segments per searcher (seen in Sentry breadcrumbs: 'Prepared commit 23'),
and each in-RAM segment consumes memory proportional to its content, driving
total footprints into the 8–11 GB range.

Fix: introduce SearcherEvent::BulkDocumentsInserted(Vec<…>) and rewrite
build_index_async to collect all documents into a single Vec and send them
as one channel message.  The background consumer processes the entire Vec in
a single execute_operations call → single commit() → single Tantivy segment,
regardless of how many documents are indexed.  insert_document_async
(used for low-frequency incremental updates) is unchanged.

Co-Authored-By: Oz <oz-agent@warp.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant