Skip to content

feat: add LanceDB version snapshot auto-cleanup#793

Open
njuboy11 wants to merge 14 commits into
CortexReach:masterfrom
njuboy11:feat/lancedb-storage-maintenance
Open

feat: add LanceDB version snapshot auto-cleanup#793
njuboy11 wants to merge 14 commits into
CortexReach:masterfrom
njuboy11:feat/lancedb-storage-maintenance

Conversation

@njuboy11
Copy link
Copy Markdown

Summary

LanceDB stores a new version snapshot on every write operation but never cleans them up. After a few days the database directory grows to 300+ MB despite only ~5-30 MB of actual data.

This PR adds a storageMaintenance.autoCleanup config block that periodically calls Table.optimize() to remove old version snapshots.

Config

{
  "storageMaintenance": {
    "autoCleanup": {
      "enabled": true,
      "intervalHours": 24,
      "retentionDays": 1
    }
  }
}

Changes

  • Add storageMaintenance.autoCleanup to plugin config schema (enabled, intervalHours, retentionDays)
  • Add MemoryStore.runMaintenance(retentionDays) → calls table.optimize({ cleanupOlderThan })
  • Add non-blocking setInterval timer in registerService.start (fires 5 min after startup, then every intervalHours)
  • Latest version is NEVER deleted by LanceDB

Closes #792

- Add storageMaintenance.autoCleanup config:
  - enabled: bool (default false)
  - intervalHours: int (default 24)
  - retentionDays: int (default 1)
- Add MemoryStore.runMaintenance() to call table.optimize()
- Add non-blocking setInterval timer for periodic cleanup
- Add JSON schema for new config in openclaw.plugin.json
- Closes CortexReach#792
@TurboTheTurtle
Copy link
Copy Markdown
Contributor

Review findings:

P1: maintenanceTimer is scoped inside start() but used in stop()

maintenanceTimer is declared inside the start callback, but stop is a sibling callback and references that variable later (index.ts:4176, index.ts:4210). That variable is out of scope for stop, so this either fails type checking or throws when the service stops. Please hoist maintenanceTimer to the same scope as backupTimer so start and stop share the same timer handle.

P1: accidental /tmp/rerank.log write leaks recall queries

The auto-recall path now does a synchronous append to /tmp/rerank.log with the recall query text (index.ts:2596). That leaks user memory queries to a fixed temp file and adds sync filesystem I/O on a hot recall path. This looks like debug instrumentation and should be removed before merge.

P1: lock ENOENT falls back to lock-free writes

runWithFileLock() now catches ENOENT from proper-lockfile.lock() and proceeds with the write without acquiring any cross-process lock (src/store.ts:242-272). That breaks the serialization guarantee around LanceDB writes, and it is unrelated to the storage-cleanup feature. If lock acquisition fails, the write should fail or retry under the lock rather than entering the critical section lock-free.

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #793 Review: feat: add LanceDB version snapshot auto-cleanup

Verdict: RESOLVE-CONFLICTS-FIRST | Author: njuboy11 | Merge state: DIRTY

Pipeline short-circuited at the conflict gate after R0 verification. Deep review deferred until the branch rebases cleanly onto the base.

Problem Statement (R1)

LanceDB creates version snapshots on writes and the plugin does not currently clean them up, so active installations can see the database directory grow far beyond the live data size over days of use. The PR attempts to add configurable periodic cleanup via LanceDB table optimization.

Why This Stopped Here

GitHub reports mergeable=CONFLICTING (merge_state_status=DIRTY) for this PR. Reviewing the diff now would:

  1. Give feedback against a branch the author must rewrite anyway
  2. Produce findings that may be invalidated by the conflict resolution
  3. Waste review cycles on code that cannot be merged as-is

Recommended Action

Author should:

  1. Rebase this branch onto the latest base (or merge the base into this branch)
  2. Resolve all merge conflicts
  3. Push the rebased branch — the re-review will be picked up automatically

Reviewed at 2026-05-17T08:54:39Z | R0+R1 gate | Conflict gate

Copy link
Copy Markdown
Collaborator

@rwmjhb rwmjhb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes. This PR has several P1 issues that need to be fixed before merge, independent of the current merge conflict.

  • maintenanceTimer appears to be scoped inside start() but referenced from sibling stop(). Please hoist it to shared plugin scope so both lifecycle callbacks use the same handle.
  • The auto-recall path writes recall query text to /tmp/rerank.log. That looks like debug instrumentation, leaks user memory queries to a fixed temp path, and adds sync I/O on a hot path. Please remove it.
  • runWithFileLock() appears to catch ENOENT from lock acquisition and continue into the write path without a cross-process lock. That weakens the LanceDB write serialization guarantee and is unrelated to storage cleanup. Please fail/retry under lock rather than performing lock-free writes.

After fixing, please rebase onto latest origin/master, preserve all current scripts/ci-test-manifest.mjs entries, and run:

npm run test:packaging-and-workflow
npm run test:core-regression

ErGouzi added 13 commits May 19, 2026 02:07
…d deadlock

When 5+ concurrent write calls hit the same proper-lockfile lock
simultaneously, all throw ENOENT due to graceful-fs race. The fixed
200ms retry delay causes all calls to retry synchronously, repeating
the race until all exhausted -> lock-free fallback -> concurrent
LanceDB writes deadlock (ECOMPROMISED).

Fix: add random jitter (200-600ms) to scatter retry timing so the
first caller acquires the lock while others wait in proper-lockfile's
internal retry queue.

Fixes the 5-minute gateway hang during auto-compaction under heavy
memory write load.
…ore file lock

Previously, runWithFileLock called proper-lockfile directly for every write.
With 5+ concurrent writes (smart-extract, regex capture, auto-capture,
inject, contextualize all firing at once), all hit proper-lockfile
simultaneously → thundering-herd ENOENT on every call → fallback to
lock-free → risk of deadlock under heavy load.

Fix: wrap runWithFileLock with runSerializedUpdate, so only ONE caller
attempts the file lock at a time. Others wait in the in-process Promise
queue. File lock remains as cross-process safety net for multi-instance
scenarios.

Root cause: runSerializedUpdate already existed but was only used by
patchMetadata; all other write paths (add, update, smartUpdate, etc.)
called runWithFileLock directly, bypassing the queue.

This eliminates the recurring ENOENT storm seen on every run completion.
…g API saturation)

The queue fix eliminated ENOENT but serialized all writes through the
queue. Since each write holds the lock during embedding API calls
(1-3s each), 5 sequential writes saturate the embedding API key for
5-10s. Auto-recall's embedding call then queues up at MiniMax and
times out at 5s.

The jitter fix alone (commit 8b9ec9c) is sufficient to prevent the
thundering-herd deadlock. ENOENT warnings are cosmetic noise; the
fallback to lock-free is safe for LanceDB's MVCC writes.

This reverts the queue wrapper added in f2ab00b while keeping the
jitter retry fix.
…API saturation)

Re-apply commit f2ab00b (reverted by a922205 due to perceived embedding
API saturation). Analysis on 2026-05-20 confirmed that embedding calls
already happen OUTSIDE runWithFileLock (callers pass pre-computed
vectors), so the lock only gates fast LanceDB writes (~50ms each).

The in-process Promise queue ensures only ONE caller hits proper-lockfile
at a time, eliminating the 6-way thundering-herd ENOENT that caused
deadlock under heavy write load (compaction + auto-capture +
smart-extraction firing concurrently).

Root cause: runSerializedUpdate already existed but was only used by
patchMetadata; all other write paths (store, bulkStore, update,
importEntry, smartUpdate, delete) called runWithFileLock directly,
bypassing the queue.

v1.1.0-beta.11
…deadlock

runWithFileLock now wraps with runSerializedUpdate (commit 5d953d6),
but update() still had its own inner runSerializedUpdate wrapper.
This created a nested deadlock:
  outer queue waits for inner → inner waits for outer → deadlock

Remove the inner wrapper; the outer queue in runWithFileLock is sufficient.
- Remove inner runSerializedUpdate from update() to prevent double-queue deadlock
- Remove debug logging, clean production code
- Queue confirmed working: single caller observed, no concurrent ENOENT, recall no longer times out
Consolidates v1.0.0 (broken) + v1.0.1 (deadlock fix) into single v1.0.0.
Changes:
- In-process Promise queue serializes LanceDB writes
- Fixes proper-lockfile ENOENT thundering herd → compaction deadlock
- Fixes update() nested runSerializedUpdate deadlock
- Clean production code, no debug logging
PR CortexReach#748 — realpath:false in lockfile.lock() to prevent ENOENT after
proactive stale lock cleanup. proper-lockfile v4 calls realpath() by
default; when the stale lock artifact is deleted, realpath() throws
ENOENT. realpath:false bypasses this entirely.

PR CortexReach#790 — drop late auto-recall results after timeout. If auto-recall
times out (5000ms), any late-arriving results from embed/rerank are
dropped at three checkpoints (post-retrieve, pre-metadata, pre-context)
to prevent stale/incomplete data injection.

备份: /root/.openclaw/npm/memory-lancedb-pro.bak.0520
@rwmjhb
Copy link
Copy Markdown
Collaborator

rwmjhb commented May 23, 2026

PR #793 Review: feat: add LanceDB version snapshot auto-cleanup

Verdict: RESOLVE-CONFLICTS-FIRST | Author: njuboy11 | Merge state: DIRTY

Pipeline short-circuited at the conflict gate after R0 verification. Deep review deferred until the branch rebases cleanly onto the base.

Problem Statement (R1)

LanceDB write operations create version snapshots that the plugin never cleans up, causing active database directories to grow far beyond the live data size over days. The PR proposes configurable periodic LanceDB table optimization to reclaim old snapshots.

Why This Stopped Here

GitHub reports mergeable=CONFLICTING (merge_state_status=DIRTY) for this PR. Reviewing the diff now would:

  1. Give feedback against a branch the author must rewrite anyway
  2. Produce findings that may be invalidated by the conflict resolution
  3. Waste review cycles on code that cannot be merged as-is

Recommended Action

Author should:

  1. Rebase this branch onto the latest base (or merge the base into this branch)
  2. Resolve all merge conflicts
  3. Push the rebased branch — the re-review will be picked up automatically

Reviewed at 2026-05-23T08:59:36Z | R0+R1 gate | Conflict gate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: add LanceDB version snapshot auto-cleanup

3 participants