feat: add LanceDB version snapshot auto-cleanup#793
Conversation
- Add storageMaintenance.autoCleanup config: - enabled: bool (default false) - intervalHours: int (default 24) - retentionDays: int (default 1) - Add MemoryStore.runMaintenance() to call table.optimize() - Add non-blocking setInterval timer for periodic cleanup - Add JSON schema for new config in openclaw.plugin.json - Closes CortexReach#792
|
Review findings: P1:
P1: accidental The auto-recall path now does a synchronous append to P1: lock
|
rwmjhb
left a comment
There was a problem hiding this comment.
PR #793 Review: feat: add LanceDB version snapshot auto-cleanup
Verdict: RESOLVE-CONFLICTS-FIRST | Author: njuboy11 | Merge state: DIRTY
Pipeline short-circuited at the conflict gate after R0 verification. Deep review deferred until the branch rebases cleanly onto the base.
Problem Statement (R1)
LanceDB creates version snapshots on writes and the plugin does not currently clean them up, so active installations can see the database directory grow far beyond the live data size over days of use. The PR attempts to add configurable periodic cleanup via LanceDB table optimization.
Why This Stopped Here
GitHub reports mergeable=CONFLICTING (merge_state_status=DIRTY) for this PR. Reviewing the diff now would:
- Give feedback against a branch the author must rewrite anyway
- Produce findings that may be invalidated by the conflict resolution
- Waste review cycles on code that cannot be merged as-is
Recommended Action
Author should:
- Rebase this branch onto the latest base (or merge the base into this branch)
- Resolve all merge conflicts
- Push the rebased branch — the re-review will be picked up automatically
Reviewed at 2026-05-17T08:54:39Z | R0+R1 gate | Conflict gate
rwmjhb
left a comment
There was a problem hiding this comment.
Requesting changes. This PR has several P1 issues that need to be fixed before merge, independent of the current merge conflict.
maintenanceTimerappears to be scoped insidestart()but referenced from siblingstop(). Please hoist it to shared plugin scope so both lifecycle callbacks use the same handle.- The auto-recall path writes recall query text to
/tmp/rerank.log. That looks like debug instrumentation, leaks user memory queries to a fixed temp path, and adds sync I/O on a hot path. Please remove it. runWithFileLock()appears to catchENOENTfrom lock acquisition and continue into the write path without a cross-process lock. That weakens the LanceDB write serialization guarantee and is unrelated to storage cleanup. Please fail/retry under lock rather than performing lock-free writes.
After fixing, please rebase onto latest origin/master, preserve all current scripts/ci-test-manifest.mjs entries, and run:
npm run test:packaging-and-workflow
npm run test:core-regression…d deadlock When 5+ concurrent write calls hit the same proper-lockfile lock simultaneously, all throw ENOENT due to graceful-fs race. The fixed 200ms retry delay causes all calls to retry synchronously, repeating the race until all exhausted -> lock-free fallback -> concurrent LanceDB writes deadlock (ECOMPROMISED). Fix: add random jitter (200-600ms) to scatter retry timing so the first caller acquires the lock while others wait in proper-lockfile's internal retry queue. Fixes the 5-minute gateway hang during auto-compaction under heavy memory write load.
…ore file lock Previously, runWithFileLock called proper-lockfile directly for every write. With 5+ concurrent writes (smart-extract, regex capture, auto-capture, inject, contextualize all firing at once), all hit proper-lockfile simultaneously → thundering-herd ENOENT on every call → fallback to lock-free → risk of deadlock under heavy load. Fix: wrap runWithFileLock with runSerializedUpdate, so only ONE caller attempts the file lock at a time. Others wait in the in-process Promise queue. File lock remains as cross-process safety net for multi-instance scenarios. Root cause: runSerializedUpdate already existed but was only used by patchMetadata; all other write paths (add, update, smartUpdate, etc.) called runWithFileLock directly, bypassing the queue. This eliminates the recurring ENOENT storm seen on every run completion.
…g API saturation) The queue fix eliminated ENOENT but serialized all writes through the queue. Since each write holds the lock during embedding API calls (1-3s each), 5 sequential writes saturate the embedding API key for 5-10s. Auto-recall's embedding call then queues up at MiniMax and times out at 5s. The jitter fix alone (commit 8b9ec9c) is sufficient to prevent the thundering-herd deadlock. ENOENT warnings are cosmetic noise; the fallback to lock-free is safe for LanceDB's MVCC writes. This reverts the queue wrapper added in f2ab00b while keeping the jitter retry fix.
…API saturation) Re-apply commit f2ab00b (reverted by a922205 due to perceived embedding API saturation). Analysis on 2026-05-20 confirmed that embedding calls already happen OUTSIDE runWithFileLock (callers pass pre-computed vectors), so the lock only gates fast LanceDB writes (~50ms each). The in-process Promise queue ensures only ONE caller hits proper-lockfile at a time, eliminating the 6-way thundering-herd ENOENT that caused deadlock under heavy write load (compaction + auto-capture + smart-extraction firing concurrently). Root cause: runSerializedUpdate already existed but was only used by patchMetadata; all other write paths (store, bulkStore, update, importEntry, smartUpdate, delete) called runWithFileLock directly, bypassing the queue. v1.1.0-beta.11
…deadlock runWithFileLock now wraps with runSerializedUpdate (commit 5d953d6), but update() still had its own inner runSerializedUpdate wrapper. This created a nested deadlock: outer queue waits for inner → inner waits for outer → deadlock Remove the inner wrapper; the outer queue in runWithFileLock is sufficient.
- Remove inner runSerializedUpdate from update() to prevent double-queue deadlock - Remove debug logging, clean production code - Queue confirmed working: single caller observed, no concurrent ENOENT, recall no longer times out
Consolidates v1.0.0 (broken) + v1.0.1 (deadlock fix) into single v1.0.0. Changes: - In-process Promise queue serializes LanceDB writes - Fixes proper-lockfile ENOENT thundering herd → compaction deadlock - Fixes update() nested runSerializedUpdate deadlock - Clean production code, no debug logging
PR CortexReach#748 — realpath:false in lockfile.lock() to prevent ENOENT after proactive stale lock cleanup. proper-lockfile v4 calls realpath() by default; when the stale lock artifact is deleted, realpath() throws ENOENT. realpath:false bypasses this entirely. PR CortexReach#790 — drop late auto-recall results after timeout. If auto-recall times out (5000ms), any late-arriving results from embed/rerank are dropped at three checkpoints (post-retrieve, pre-metadata, pre-context) to prevent stale/incomplete data injection. 备份: /root/.openclaw/npm/memory-lancedb-pro.bak.0520
PR #793 Review: feat: add LanceDB version snapshot auto-cleanupVerdict: RESOLVE-CONFLICTS-FIRST | Author: njuboy11 | Merge state: DIRTY
Problem Statement (R1)LanceDB write operations create version snapshots that the plugin never cleans up, causing active database directories to grow far beyond the live data size over days. The PR proposes configurable periodic LanceDB table optimization to reclaim old snapshots. Why This Stopped HereGitHub reports
Recommended ActionAuthor should:
Reviewed at 2026-05-23T08:59:36Z | R0+R1 gate | Conflict gate |
Summary
LanceDB stores a new version snapshot on every write operation but never cleans them up. After a few days the database directory grows to 300+ MB despite only ~5-30 MB of actual data.
This PR adds a
storageMaintenance.autoCleanupconfig block that periodically callsTable.optimize()to remove old version snapshots.Config
{ "storageMaintenance": { "autoCleanup": { "enabled": true, "intervalHours": 24, "retentionDays": 1 } } }Changes
storageMaintenance.autoCleanupto plugin config schema (enabled,intervalHours,retentionDays)MemoryStore.runMaintenance(retentionDays)→ callstable.optimize({ cleanupOlderThan })setIntervaltimer inregisterService.start(fires 5 min after startup, then everyintervalHours)Closes #792