feat: add NIP-50 support by dskvr · Pull Request #160 · hoytech/strfry

dskvr · 2025-11-12T13:35:01Z

Overview

This PR implements NIP-50 (Search Capability) for strfry, enabling full-text search across Nostr events using BM25 ranking. The implementation includes:

Full-text search with relevance ranking (BM25 algorithm)
Configurable search backends (LMDB, Noop)
Background indexer with catch-up mechanism
Production-ready performance optimizations
benchmark suite*

Architecture

Core Components

Search Provider Interface (src/search/SearchProvider.h)

Abstract interface allowing pluggable search backends
Supports index creation, document insertion, and search queries

LMDB Search Backend (src/search/LmdbSearchProvider.h)

Inverted index stored in LMDB tables
Token-based posting lists with term frequency data
Document metadata for BM25 scoring (document length, kind)
Efficient packed binary format for postings

Background Indexer (in LmdbSearchProvider::runCatchupIndexer())

Async worker thread that catches up indexing of historical events
Clean shutdown and progress persistence via SearchState.lastIndexedLevId
Complemented by on-write indexing in the writer path (new events are indexed immediately)

Search Runner (src/search/SearchRunner.h)

Executes search queries within the existing query scheduler
Integrates alongside traditional index scans
Validates content by requiring presence of all parsed query tokens in event text
BM25 scoring (k1=1.2, b=0.75)

Database Schema

New LMDB tables (defined in golpe.yaml):

SearchIndex (DUPSORT)
  keys: tokens (lowercase, normalized strings)
  vals: postings [levId:48 bits][tf:16 bits] packed as host-endian uint64

SearchDocMeta (INTEGERKEY)
  keys: levIds (uint64)
  vals: packed [docLen:16][kind:16][reserved:32] as uint64

SearchState
  - lastIndexedLevId: tracks indexing progress
  - indexVersion: schema version for future migrations

Configuration

Key settings in strfry.conf (relay.search):

relay {
  search {
    enabled = true                  # Enable NIP‑50 search
    backend = "lmdb"                # or "noop"

    # Indexing/Query controls
    indexedKinds = "1, 30023"       # Kind pattern: numbers, ranges, '*', exclusions (-A-B)
    maxQueryTerms = 16              # Max terms parsed from a query
    maxPostingsPerToken = 100000    # Cap per token (pruning/vacuum TBD)
    maxCandidateDocs = 1000         # Max candidate docs before scoring
    overfetchFactor = 5             # Fetch limit × factor, bounded by maxCandidateDocs

    # Recency tie-breaker (optional)
    recencyBoostPercent = 0         # Integer percent (0–100); 1 = 1%

    # Candidate pre-scoring ranking
    candidateRankMode = "order"     # "order" | "weighted"
    candidateRanking = "terms-tf-recency"  # When mode="order": see supported orders below
    rankWeightTerms = 100           # When mode="weighted": weight for matched terms
    rankWeightTf = 50               # When mode="weighted": weight for aggregate TF
    rankWeightRecency = 10          # When mode="weighted": weight for recency
  }
}

Supported candidateRanking orders (desc for each component):

terms-tf-recency (default)
terms-recency-tf
tf-terms-recency
tf-recency-terms
recency-terms-tf
recency-tf-terms

Configuration Parameters

enabled: Master switch for search functionality
backend: Search provider implementation ("lmdb" or "noop")
indexedKinds: Pattern of kinds to index (numbers/ranges/*/exclusions)
maxQueryTerms: Maximum query terms parsed
maxPostingsPerToken: Max postings per token key (upper bound during fetch; pruning TBD)
maxCandidateDocs: Maximum candidates for scoring
overfetchFactor: Candidate over-fetch before post-filtering
recencyBoostPercent: Recency tie-breaker percent (0–100; 1 = 1%)
candidateRankMode: order or weighted
candidateRanking: Order used when mode=order (list above)
rankWeightTerms/rankWeightTf/rankWeightRecency: Weights for mode=weighted

Usage

Enabling Search

Build strfry:
```
make -j$(nproc)
```

Update strfry.conf:

relay {
    search {
        enabled = true
        backend = "lmdb"
    }
}

Start strfry:
```
./build/strfry relay
```

Indexing behavior:

New events are indexed on write (writer path)
Background indexer catches up historical events and updates SearchState
NIP‑11 advertises 50 when the provider is healthy (index present and near head)

Search Queries

Clients can issue NIP-50 search queries using the search filter field:

{
  "kinds": [1],
  "search": "bitcoin lightning network",
  "limit": 100
}

Search features:

Multi-token queries with BM25 relevance scoring
Case-insensitive matching
Results ranked by relevance
Combines with other filter criteria (kinds, authors, tags, etc.)

Monitoring

Background indexer logs:

Search indexer catching up: <startLevId> to <endLevId> (head: <mostRecent>)

Query metrics include search-specific timings when relay.logging.dbScanPerf = true (scan=Search).

Performance Characteristics

Indexing Performance

Tokenization: ~10-15 us/event (depends on content length)
Index insertion: ~50-100 us/event (LMDB commit overhead)
Catch-up rate: ~5000-10000 events/sec on NVMe SSDs

Query Performance

Simple queries (1-2 tokens): 5-20 ms (p50), 30-60 ms (p95)
Complex queries (3+ tokens): 10-40 ms (p50), 50-100 ms (p95)
Performance scales with maxCandidateDocs and result set size

Tuning guidelines:

Lower maxCandidateDocs for faster queries with slightly lower recall
Increase overfetchFactor to improve recall for multi-token queries

Benchmark Suite

Put something together for benchmarks, but didn't finish. Will likely remove it before marking ready for review

A comprehensive benchmark suite is included under `bench/`:

bench/
├── README.md              # Benchmark plan and structure
├── SCENARIOS.md           # Scenario creation guide
├── scenarios/
│   ├── small.yml         # 100k events
│   └── medium.yml        # 1M events
└── scripts/
    ├── prepare.sh        # Generate and populate test databases
    ├── run.sh            # Execute benchmarks
    ├── sysinfo.sh        # Collect system info (sanitized)
    └── report.py         # Generate Markdown reports

Running Benchmarks

Prepare a test database:
```
bench/scripts/prepare.sh -s scenarios/small.yml --workers 4
```
This generates cryptographically valid Nostr events using nak and ingests them into a fresh database.

Run the benchmark:

bench/scripts/run.sh -s scenarios/small.yml --out bench/results/raw/small-$(date +%s)

Generate reports:

bench/scripts/report.py bench/results/raw/* > bench/results/summary.md

Benchmark Metrics

Throughput: events/s sent and delivered
Latency: p50/p95/p99 for REQ scan, EVENT->OK, search queries
Resource usage: RSS memory, CPU utilization, disk I/O
Search-specific: index catch-up state, results cardinality
System profile: CPU model, memory, storage type (sanitized)

Testing

Manual Testing

Index a test database:

# Import some events
cat events.ndjson | ./build/strfry import

# Start relay with search enabled
./build/strfry relay

Issue search queries via WebSocket:

["REQ", "test-sub", {"kinds": [1], "search": "nostr bitcoin", "limit": 50}]

Verify results are returned in relevance order

Integration Points

DBQuery.h: Search queries execute alongside traditional index scans
ActiveMonitors.h: Search filters excluded from live subscription indexes (one-shot queries)
QueryScheduler.h: Search provider injected into query execution path
cmd_relay.cpp: Background indexer lifecycle management

Migration Notes

Existing Databases

For existing strfry installations:

Stop the relay
Rebuild with updated schema: cd golpe && ./build.sh && cd .. && make
Enable search in config
Restart relay

The indexer will automatically catch up on all existing events. Monitor logs for progress.

Rollback

To disable search without data loss:

Set relay.search.enabled = false in config
Restart relay

The search tables remain in the database but are not used. They can be manually removed using the mdb command-line tools if desired.

Known Limitations

Search is limited to content field of events (does not index tags or metadata)
No phrase matching or proximity operators (only individual tokens)
No stemming or lemmatization (exact token matching)
Large result sets may require tuning maxCandidateDocs for optimal performance
Search filters are one-shot queries and do not support live subscriptions

Future Enhancements

Potential improvements for future iterations:

Phrase search and proximity operators
Stemming and language-specific analyzers
Alternative backends (e.g., external Elasticsearch/MeiliSearch)
Search query cost accounting for rate limiting

Related Issues

Potentially Resolves Request: NIP-50 Support #40
Implements NIP-50 as specified at: https://github.com/nostr-protocol/nips/blob/master/50.md

TODO List before eligible as candidate for merge

Run code cleanup pass to remove any code-smell that may have been introduced during debug iterations
Squash PR into a single commit for a clean history

leesalminen · 2025-11-18T12:57:27Z

I've been working on testing this with @dskvr , have some feedback:

My relay has ~20m events, so this is a good test of the indexing functionality. We ran into some troubles with indexing (it stalled out after ~8m events), so @dskvr added some additional improvements in sandwichfarm/feature/nip-50-indexertweaks, which is the branch I've continued testing on.

I started indexing the db with this config:

    search {
        # Enable NIP-50 search capability (requires search backend)
        enabled = true

        # Search backend to use: lmdb, noop (or external in future)
        backend = "lmdb"

        # Maximum number of search terms allowed in a query
        maxQueryTerms = 6

        # Comma-separated kinds/ranges to index. Supports: single (1), ranges (1000-1999), wildcard (*), exclusions (-5000-5999)
        indexedKinds = "0,1,34236,30000-30003,30023,34550"

        # Maximum number of postings (documents) per search token
        maxPostingsPerToken = 100000

        # Maximum candidate documents to fetch during search (multiple of limit)
        maxCandidateDocs = 1000

        # Recency tie-breaker percent (0–100); 1 = 1% boost for newest events
        recencyBoostPercent = 1

        # Over-fetch multiplier to compensate for post-filtering (candidates = limit × factor, bounded by maxCandidateDocs)
        overfetchFactor = 5

        # Candidate ranking order before scoring: terms-tf-recency | terms-recency-tf | tf-terms-recency | tf-recency-terms | recency-terms-tf | recency-tf-terms
        candidateRanking = "terms-tf-recency"

        # Candidate ranking mode: order | weighted
        candidateRankMode = "weighted"

        # Weighted ranking weights (only used when candidateRankMode = "weighted")
        rankWeightTerms = 100
        rankWeightTf = 50
        rankWeightRecency = 10
    }

Indexing started running great, I came back this morning and my logs are getting spammed with:

[ 8B7FE6C0]INFO| Search indexer catching up: 13070001 to 13071000 (head: 18740192)

Where the counter never increments. It just keeps sending this same log over and over.

I tried search_set_state and incrementing by 1 and restart relay, but the logging issue persists.

It's possible this is a red herring log, where because of my indexedKinds filter, it's not counting up correctly.

My search_index_stats are:

Search index LMDB statistics:
  SearchIndex:
  entries        : 6375268
  depth          : 4
  branch pages   : 1430
  leaf pages     : 115305
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 478146560 bytes (456.00 MiB)
  SearchDocMeta:
  entries        : 6331151
  depth          : 4
  branch pages   : 687
  leaf pages     : 78768
  overflow pages : 0
  page size      : 4096 bytes
  approx size    : 325447680 bytes (310.37 MiB)
SearchState:
  lastIndexedLevId : 13070000
  indexVersion     : 1

On the bright side, query performance is great. Querying ["REQ", "test", { "search": "taylor swift" } ] is nearly instant, barely noticeable performance hit.

I think this PR is on the right track here, just needs some tweaking on rebuilding the index on large datasets.

Just my 2 sats.

hoytech · 2026-02-27T16:13:37Z

This is very impressive, thank you! Sounds like more testing is necessary, but yes this looks broadly like it's on the right track.

dskvr · 2026-03-02T18:14:17Z

@leesalminen Are you still running this branch, if so, any issues other than the debugging output bug?

@hoytech The only issue I am aware of is that when interrupting an indexing operation, the debugging output is not correct when resuming.

Also, there needs to be a method to destroy the index.

dskvr · 2026-03-03T12:03:43Z

Fixed infinite loop on missing events: The catch-up indexer now always advances lastProcessedLevId regardless of whether an event's payload exists, and persists progress at batch end, preventing the indexer from looping forever on sparse levId ranges.
Eliminated per-event write transactions for skipped events: Replaced individual LMDB write+fsync operations for each filtered/errored event with a single batch-end persist, dramatically reducing I/O on relays with restrictive indexedKinds.
Always log batch progress: Changed logging from only reporting when indexed > 0 to always showing indexed=N skipped=M range=[start..end] head=H, so the indexer no longer appears stuck when filtering large batches.
Added duplicate-detection to prevent MDB_APPENDDUP conflicts: indexEventWithTxnHook now checks if an event was already indexed (by the on-write path) before inserting postings, avoiding MDB_KEYEXIST errors when the catch-up indexer re-encounters events the live writer already handled.

@leesalminen Bugs you reported in feature/nip-50-indexertweaks should now be resolved and is now merged into this branch (feature/nip-50)

Adds ISearchProvider interface, NoopSearchProvider stub, SearchProvider factory (makeSearchProvider()), and KindMatcher for kind-range filtering. RelayServer grows a unique_ptr<ISearchProvider> searchProvider field plus searchIndexerThread/searchIndexerRunning members. RelayWebsocket advertises NIP-50 in supported_nips when the provider is healthy. QueryScheduler gains the searchProvider field to pass through to DBQuery. Initialization order in cmd_relay.cpp ensures searchProvider is set before websocket threads start, avoiding any data race. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

Full LMDB-backed full-text search implementation. Tokenizer.h splits text into normalized lowercase tokens (2-48 chars). LmdbSearchProvider.h implements BM25 scoring over SearchIndex (DUPSORT inverted index with packed levId:48/tf:16 postings) and SearchDocMeta (per-doc len+kind metadata). Supports configurable candidate ranking strategies and recency boost. Uses lmdb::from_sv<uint64_t>/to_sv<uint64_t> throughout for alignment-safe LMDB value reads/writes (MERGE-03: eliminates all reinterpret_cast UB). Includes dup-detection guard to prevent re-indexing already-indexed events and batch-process logging. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

SearchRunner.h integrates ISearchProvider with DBQuery for NIP-50 query execution. RelayWriter passes searchProvider to writeEvents() and includes a search indexing loop after write commit. RelayCron registers search index cleanup hooks for event expiration. cmd_relay.cpp initializes the search provider via makeSearchProvider() before websocket threads start, starts/joins the searchIndexer background thread. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

Adds three search maintenance commands (auto-discovered by golpe): - search-reindex: catch-up indexer with checkpoint support, manual levId override, and batch-progress logging - search-set-state: manually set lastIndexedLevId and indexVersion in SearchState table - search-index-stats: report index size (token count, doc count, table sizes) cmd_delete.cpp gains search index cleanup to remove indexed events on manual deletion. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

golpe.yaml: add SearchIndex (DUPSORT inverted index), SearchDocMeta (per-doc BM25 metadata, MDB_INTEGERKEY), and SearchState (index progress tracking) tables. src/apps/relay/golpe.yaml: add 13 relay__search__* config keys covering enabled flag, backend, indexedKinds, BM25 ranking weights, candidate ranking strategy, and query limits. Preserves all upstream additions: relay__auth__enabled, relay__auth__serviceUrl, relay__maxTagsPerFilter, relay__filterValidation__* block. strfry.conf: add search { } config block after filterValidation block. DBQuery.h: integrate SearchRunner and ISearchProvider; constructor accepts optional searchProvider parameter; process() dispatches to searchRunner when filter has search field and provider is healthy. ActiveMonitors.h: add hasSearch() guard to skip search-only filters in the monitor fast path. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

filters.h: add std::optional<std::string> search field and search key parsing to NostrFilter constructor; add hasSearch() predicate. Preserves upstream's try-catch wrapping for filter parse errors (MERGE-05), and relay__maxTagsPerFilter config key replacing hardcoded limit. events.h/events.cpp: merge writeEvents() signature to accept both logLevel (uint64_t, replaces upstream's bool logDeletions) and ISearchProvider *searchProvider. Preserves upstream's a-tag deletion handling for kind-5 parameterized replaceable events (parseATag call, replaceDeletion index). Adds searchProvider->deleteEvent() in the deletion loop for search index consistency. RelayReqWorker.cpp: pass searchProvider to QueryScheduler so DBQuery receives it at construction time for search-aware query dispatch. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

Adds bench/scenarios/ (small.yml, medium.yml) for 10k and 1M event benchmark scenarios, and bench/scripts/ (prepare.sh, run.sh, report.py, sysinfo.sh) for reproducible benchmark runs with search enabled/disabled. bench/SCENARIOS.md documents the methodology. Co-Authored-By: sandwich <dskvr@users.noreply.github.com>

dskvr · 2026-04-27T23:26:33Z

Correction to my prior comment: the squashed branch from earlier today silently regressed several pieces of upstream evolution (the rebase used a diff-apply approach that overwrote upstream's later changes in non-conflicting files). The PR has now been re-rebased using cherry-pick onto current master, which preserves upstream's evolution properly. New commit chain:

47d29e4 feat: add SearchProvider abstraction and NIP-50 relay integration
4eb2b4d feat: add LmdbSearchProvider with BM25 scoring
ee8d095 feat: wire NIP-50 search indexer and runner into relay
792e61d feat: add search dbutils commands
5df178d feat: add search LMDB schema and config plumbing
d74fc58 feat: add NIP-50 filter parsing and query path integration
c67cd48 chore: add bench scenarios and search benchmark scripts

Verification on the current branch:

Test	Result
`make -j16`	clean, exit 0
`perl test/writeTest.pl` (full 408-line upstream version)	30/30 pass, including all 11 `a`-tag deletion sub-tests
`perl test/filterFuzzTest.pl scan-limit`	767 MATCH OK / 0 MISMATCH (45s)
`perl test/filterFuzzTest.pl scan`	500 MATCH OK / 0 MISMATCH (30s)
`perl test/filterFuzzTest.pl monitor`	202 MATCH OK / 0 MISMATCH (30s)
End-to-end NIP-50 search via `nak --search`	returns expected hits, no false positives
NIP-11 `supported_nips`	`[1,2,4,9,11,22,28,40,45,50,70,77]`

Apologies for the noise; the diff against master should now be just the NIP-50 surface plus the bench scripts.

leesalminen · 2026-05-21T14:08:06Z

Testing report from production relay (~26M events, 52 GiB LMDB, 2 weeks uptime)

Tested the PR on a live nostr relay over ~2 weeks. Lots of good news, some real bugs, and (importantly) a clear workaround that gets writes flowing on prod again.

Build/env:

strfry 1.1.0-85-gc04ed93 (PR HEAD at start of testing)
Debian 12 bookworm, gcc 12.2.0, glibc 2.36, liblmdb 0.9.24
Default flags (-O3 -g), make -j8
8 cores / 15 GiB RAM
indexedKinds = "0,1,34236,30000-30003,30023,34550"
Steady production write/read traffic throughout

TL;DR

Search works (queries in ~250 ms; new events searchable within seconds after a clean restart).
Reindex crashes on a specific event in our dataset — but the live catch-up indexer cleanly resumes past it after the LMDB atomic commit, so no manual search_set_state needed in practice.
Long-term, the relay crashes every 2-3 days under prod load (SIGABRT double-free + one SIGSEGV).
With relay.search.enabled = true and indexer caught up to head, the Writer thread wedges within seconds. Confirmed unrelated to the policy plugin. Flipping search.enabled = false immediately restores normal writes. So search and writes are currently mutually exclusive on this build.

✅ What works well

Search reads: NIP-50 query for "bitcoin" returned 3 ranked, relevant matches in 253 ms on the warm index.
Live indexing of fresh writes: imported a unique-marker kind:1 via strfry import, restarted relay (search enabled), catch-up indexer picked it up in ~5 s, search for the marker returned the event in 194 ms.
search_set_state is a great escape hatch for the reindex crash below — though it turned out we didn't even need it (see Bug 1).

🐛 Bug 1 — `search_reindex --restart` aborts on bad LevId

Running offline search_reindex --restart against this dataset reproducibly crashes:

Progress: 7582000/25856845 events scanned, 3657372 indexed, 3924628 skipped
Progrdouble free or corruption (!prev)

Loguru caught a signal: SIGABRT

Post-mortem search_index_stats showed:

SearchDocMeta entries : 3,693,755
SearchState lastIndexedLevId : 7,625,146
indexVersion : 0   (in-progress)

So the actual offending LevId is ~7,625,146 (the truncated Progress line was about to log advancing past it; LMDB had already committed up to that point atomically). Single-threaded, no concurrent traffic — the bug is in the indexer itself, not a race.

What worked: restarting the relay with relay.search.enabled = true resumed the catch-up indexer at LevId 7,625,147 and it processed everything past that without issue. No need to search_set_state --lev-id since the LMDB partial commit had already advanced. The remaining ~18M events indexed fine and search became fully usable.

Suggestion: search_reindex should catch the abort signal, log the offending event ID and LevId, advance past it, and continue — optionally accumulating a count of skipped events at the end. On a 26M-event DB, having to manually intervene per crashing event is rough.

I can extract LevId range 7,625,000–7,625,200 from our DB and send a minimal repro if you suggest a recipe (e.g., strfry scan a small JSONL slice through search_reindex against a throwaway DB).

🐛 Bug 2 — intermittent crashes during long-term running

Over ~14 days uptime on this PR, the relay crashed 4 times (all auto-restarted by systemd; LMDB transactional integrity kept indexed state intact across each):

Date	Signal	Signature
May 7	SIGABRT	`free(): double free detected in tcache 2`
May 15	SIGABRT	`free(): double free detected in tcache 2`
May 17	SIGSEGV	(no extra info)
May 21	SIGABRT	`free(): double free detected in tcache 2`

The May 7 crash happened with relay.search.enabled = false too, so the bug isn't strictly gated by the flag — Search provider initialized: backend=lmdb enabled=0 healthy=1 shows the provider inits regardless of enabled, so shared code may still be reachable.

The SIGSEGV on May 17 (vs. the otherwise consistent SIGABRT) suggests broader memory-management issues, not a single specific double-free site. Cadence: roughly one crash every 2-3 days under steady prod traffic.

No core dumps captured — this host had ulimit -c 0 and no systemd-coredump. Happy to re-run with cores enabled if a backtrace would help.

🐛 Bug 3 — wildly inconsistent no-match query latency

Searching for unique tokens not in the index returned EOSE with 0 events but at very different speeds across 5 trials:

marker=klaudejw50ahbz  → EOSE in 20,008 ms  (0 results)
marker=klaudewlutbdvg  → EOSE in     192 ms  (0 results)
marker=klaudeefplccci  → EOSE in 16,572 ms  (0 results)
marker=klaude34rlwdk8  → EOSE in     191 ms  (0 results)

For 6-char token prefixes that should miss the inverted index entirely, an alternating ~190 ms vs ~17–20 s pattern is surprising. Could be tokenizer behavior on specific prefixes, posting-list fanout, or a cold-cache effect. From an end-user perspective, 20s for zero results is rough.

🐛 Bug 4 — search indexer wedges the Writer thread after catch-up

This is the show-stopper for us on prod, but the cause is now well-isolated:

Symptom: with relay.search.enabled = true and the indexer caught up to head, the Writer thread goes silent shortly after restart:

Right after restart: Writer logs a flurry of Inserted event lines for ~10–20 seconds.
Then: Writer log lines drop to 0–3 per minute on a relay that should be ingesting many events per second.
DB size stops growing.
New publishes from external clients (tested from a separate machine to wss://no.str.cr) get no OK response, 30 s timeout. The events never appear in [Writer] logs.
Reads/queries continue to work normally throughout.

Isolation:

Initially suspected our writevalidate.js policy plugin (it has a known sin: silently continue-ing on JSON parse errors in its catch block, which would leave strfry waiting forever for a policy response that never comes). Disabled the plugin entirely and restarted → same wedge symptoms appeared in the same timeframe. Not the plugin.
Then disabled search (relay.search.enabled = false) and restarted → writes flow normally: 281 [Writer] log lines in 100 s vs. 3 lines in 90 s with search enabled. External publish from a separate machine gets OK accepted=true in ~25 ms after WS open.

So the indexer thread, in steady state after catch-up completes, appears to be contending with the Writer thread for the LMDB write txn and the writer either deadlocks or is starved out. The mutual exclusivity of search vs. writes makes the PR unshippable for prod as-is, but the failure mode is well-defined and the workaround is trivial (just don't turn search on).

Misc observations / doc asks

relay.search.enabled = false silently no-ops the search filter field instead of CLOSED-ing the REQ. With search disabled, ["REQ", id, {"search": "foo", "limit": 5}] returned 5 random recent events as if no filter were present. Most relays would close the sub with an "unsupported" message — this behavior could confuse NIP-50 clients into thinking the relay supports search. Either advertise it correctly in NIP-11 or CLOSED the sub when search is off.
Search provider initialized: ... enabled=0 healthy=1 appears even when relay.search.enabled = false. Confirm whether the disabled-path is truly inert vs. just gates queries.
Reindex throughput: ~6,500 events/sec offline (~1.6 MB/s). 66 min projected for a 26M-event DB. Worth calling out in the PR description so operators can plan downtime.
Live catch-up throughput: ~400–500 LevIds/sec sustained, with ~60% match rate against our indexedKinds. Caught up 18M events in roughly 10–14 hours under live load.
bm25 { k1, b } sub-block: existing configs predating the PR won't have it. golpe's parser tolerates the missing block (defaults used), but the migration notes should call this out: either "block is optional" or "recommended values: k1=1.2, b=0.75".
search_index_stats hangs when the relay is running: command blocked indefinitely until SIGTERM. Worked fine when relay was stopped. Might be read-txn related.

Happy to dig deeper on any of these — get a coredump for Bug 2, isolate the bad event for Bug 1, run more search latency experiments for Bug 3, or test specific theories for Bug 4. Just let me know what would be most useful.

leesalminen · 2026-05-21T14:57:41Z

Bug 4 fix — verified working on prod

Patched the build on our prod relay and Bug 4 (Writer self-deadlock) is resolved. Confirmed end-to-end with search enabled and live traffic:

Writer activity: 82 log lines in 120 s with relay.search.enabled = true (vs. 3 lines in 90 s pre-fix — the wedge).

Live publish + search round-trip from an external client to wss://no.str.cr:

open 169ms
OK accepted=true at 283ms (event published)
REQ {"search": "<unique-marker>"} sent
✓ FOUND OUR EVENT at 2486ms (search query latency: 203ms)

Repeated the test multiple times; consistently sees both new writes and NIP-50 searches working simultaneously.

The diff is small. Defers searchProvider->deleteEvent() until after the outer write txn commits, mirroring the existing inline indexEvent pattern in RelayWriter.cpp:94. Also skips re-indexing events that are getting deleted in the same batch (per the [earlier review feedback]).

I'm not opening a PR for this — sharing as a starting point in case it's useful. Treat as a sketch, not as a clean upstreamable commit; you may want to handle WriterPipeline.h callers (mesh sync / strfry stream) similarly, depending on how you want search to behave for those write paths.

Diff against current PR HEAD (`c04ed93`)

diff --git a/src/events.h b/src/events.h
index 6757cd8..3d84124 100644
--- a/src/events.h
+++ b/src/events.h
@@ -87,7 +87,7 @@ struct EventToWrite {
 };
 
 
-void writeEvents(lmdb::txn &txn, NegentropyFilterCache &neFilterCache, std::vector<EventToWrite> &evs, uint64_t logLevel = 1, class ISearchProvider *searchProvider = nullptr);
+void writeEvents(lmdb::txn &txn, NegentropyFilterCache &neFilterCache, std::vector<EventToWrite> &evs, uint64_t logLevel = 1, class ISearchProvider *searchProvider = nullptr, std::vector<uint64_t> *outLevIdsToRemoveFromSearch = nullptr);
 bool deleteEventBasic(lmdb::txn &txn, uint64_t levId);
 
 template <typename C>
diff --git a/src/events.cpp b/src/events.cpp
index 9c98c99..9a75971 100644
--- a/src/events.cpp
+++ b/src/events.cpp
@@ -249,7 +249,7 @@ static bool isEventABeforeEventB(const PackedEventView &a, const PackedEventView
     return a.created_at() < b.created_at() || (a.created_at() == b.created_at() && a.id() > b.id());
 }
 
-void writeEvents(lmdb::txn &txn, NegentropyFilterCache &neFilterCache, std::vector<EventToWrite> &evs, uint64_t logLevel, ISearchProvider *searchProvider) {
+void writeEvents(lmdb::txn &txn, NegentropyFilterCache &neFilterCache, std::vector<EventToWrite> &evs, uint64_t logLevel, ISearchProvider *searchProvider, std::vector<uint64_t> *outLevIdsToRemoveFromSearch) {
     bool logDeletions = logLevel > 0;
     std::sort(evs.begin(), evs.end(), [](auto &a, auto &b) {
         auto aC = a.createdAt();
@@ -386,14 +386,14 @@ void writeEvents(lmdb::txn &txn, NegentropyFilterCache &neFilterCache, std::vect
                     updateNegentropy(PackedEventView(evToDel->buf), false);
                     deleteEventBasic(txn, levId);
 
-                    // Remove from search index
-                    if (searchProvider && searchProvider->healthy()) {
-                        try {
-                            searchProvider->deleteEvent(levId);
-                        } catch (std::exception &e) {
-                            // Don't fail deletions if search removal fails, just log
-                            LE << "Search delete failed for levId=" << levId << ": " << e.what();
-                        }
+                    // Stash levId for search-index removal AFTER outer txn commits.
+                    // Calling searchProvider->deleteEvent() here would open a second
+                    // top-level LMDB write txn while we still hold this one, causing
+                    // the Writer thread to self-deadlock on the LMDB writer mutex.
+                    // The actual deleteEvent call happens in the caller (e.g. RelayWriter)
+                    // after txn.commit(), mirroring the existing inline indexEvent pattern.
+                    if (outLevIdsToRemoveFromSearch) {
+                        outLevIdsToRemoveFromSearch->push_back(levId);
                     }
                 }
 
diff --git a/src/apps/relay/RelayWriter.cpp b/src/apps/relay/RelayWriter.cpp
index ba3e6d2..785605a 100644
--- a/src/apps/relay/RelayWriter.cpp
+++ b/src/apps/relay/RelayWriter.cpp
@@ -1,5 +1,7 @@
 #include "RelayServer.h"
 
+#include <unordered_set>
+
 #include "PluginEventSifter.h"
 #include "PrometheusMetrics.h"
 
@@ -62,10 +64,15 @@ void RelayServer::runWriter(ThreadPool<MsgWriter>::Thread &thr) {
 
         // Do write
 
+        // Collected inside writeEvents (under the outer write txn) and processed
+        // AFTER that txn commits, to avoid a self-deadlock: searchProvider->deleteEvent
+        // opens its own top-level write txn and LMDB only allows one per env/thread.
+        std::vector<uint64_t> deletedLevIdsForSearch;
+
         try {
             auto t0 = std::chrono::steady_clock::now();
             auto txn = env.txn_rw();
-            writeEvents(txn, neFilterCache, newEvents, 1, searchProvider.get());
+            writeEvents(txn, neFilterCache, newEvents, 1, searchProvider.get(), &deletedLevIdsForSearch);
             txn.commit();
             auto t1 = std::chrono::steady_clock::now();
             auto us = std::chrono::duration_cast<std::chrono::microseconds>(t1 - t0).count();
@@ -90,10 +97,18 @@ void RelayServer::runWriter(ThreadPool<MsgWriter>::Thread &thr) {
 
         // Index events for search (NIP-50)
         // Always index and run search when provider exists, regardless of healthy() status
-        // healthy() only gates NIP-11 advertisement
+        // healthy() only gates NIP-11 advertisement.
+        // Both indexEvent and the deleteEvent loop below run AFTER txn.commit() because
+        // each one opens its own top-level LMDB write txn, which would self-deadlock
+        // if attempted inside the outer write txn.
         if (searchProvider) {
+            // Skip indexing for events that were also marked for deletion in this same
+            // batch (e.g. inserted then immediately superseded by a NIP-09 delete or a
+            // replaceable/d-tag replacement later in the batch). Avoids wasted index+delete.
+            std::unordered_set<uint64_t> deletedSet(deletedLevIdsForSearch.begin(), deletedLevIdsForSearch.end());
+
             for (auto &newEvent : newEvents) {
-                if (newEvent.status == EventWriteStatus::Written) {
+                if (newEvent.status == EventWriteStatus::Written && !deletedSet.count(newEvent.levId)) {
                     PackedEventView packed(newEvent.packedStr);
                     try {
                         searchProvider->indexEvent(newEvent.levId, newEvent.jsonStr, packed.kind(), packed.created_at());
@@ -103,6 +118,17 @@ void RelayServer::runWriter(ThreadPool<MsgWriter>::Thread &thr) {
                     }
                 }
             }
+
+            // Deferred search-index removal for events deleted/replaced in this batch.
+            if (searchProvider->healthy()) {
+                for (auto levId : deletedLevIdsForSearch) {
+                    try {
+                        searchProvider->deleteEvent(levId);
+                    } catch (std::exception &e) {
+                        LE << "Search delete failed for levId=" << levId << ": " << e.what();
+                    }
+                }
+            }
         }
 
         // Log

leesalminen · 2026-05-22T12:46:45Z

Patched build — 21 h prod soak

Just over a day in on the patched build. No new crashes, no re-wedges, writes flowing under steady production load:

Uptime: 21 h since the restart that loaded the patch. The only signal in the journal in that window is the SIGTERM from that restart itself — no SIGABRT, no SIGSEGV.
Writer activity: 531 [Writer] log lines over a 5-minute window. Pre-fix steady-state was ~3 lines / 90 s, so the wedge is decisively gone.

The exact operation that used to trigger the self-deadlock is now sailing through. Picked one out of the recent journal:

May 22 06:43:05 ... [Writer]INFO| Deleting event (d-tag). id=9953236155052e0d5269af5b1b89d4b2724c94982d6ffcab1704664344e6eeb1

End-to-end publish + search from an external client still works:

open 149 ms
OK accepted=true at 3036 ms
REQ {"search":"<unique-marker>"} sent
✓ FOUND OUR EVENT at 5252 ms (search query latency: 215 ms)

One observation worth flagging — not a regression from the patch, just something to keep an eye on: write OK ack latency was ~100 ms right after restart and climbed to ~2.9 s under sustained load + active indexer. Reads are unaffected (~200 ms). Could be writer-queue depth, the writevalidate.js policy plugin under load, or indexer-thread contention with the writer when both are doing real per-event work. Will report back if it gets worse.

dskvr · 2026-05-23T17:11:03Z

@leesalminen 2.9s isn't acceptable. I'm going to see what is causing the added latency to the write path, index writes are supposed to be non-blocking. I'll see if I can trace it down and either fix or optimize the implementation.

Many thanks for reporting back!

…hProvider::query - Move per-token df computation before getTotalDocs/getAvgDocLen calls - Return empty results immediately when any token has df==0 (AND semantics) - Eliminates O(N) corpus-stat cursor walks for no-match queries - Add dfs.reserve() and idfs.reserve() for minor vector allocation improvement

…ce in search indexing Bug 1 (cmd_search_reindex): held an LMDB read txn open while indexEventWithTxnHook opened a write txn on the same thread, violating LMDB's one-txn-per-thread rule (MDB_NOTLS not set). The decoded std::string_view was also used after the write txn could invalidate the underlying LMDB page. Fix: wrap the read phase in a nested scope that closes before indexEventWithTxnHook is called; copy decoded JSON view to std::string inside that scope. Bug 2a (runCatchupIndexer): same use-after-txn-close pattern — rtxn.commit() was called while the decoded std::string_view was still live, freeing LMDB-managed pages for uncompressed payloads. Fix: same owned-string copy pattern, rtxn closes inside its scope. Bug 2b (KindMatcher lazy init): mutable bool kindMatcherInitialized guarded by a bare if-check was a data race between the Writer thread and the catch-up indexer thread on concurrent first-use. Fix: replace with std::once_flag + std::call_once().

leesalminen · 2026-05-25T02:20:40Z

Pulled 13b36f5 onto prod (wss://no.str.cr, ~26M events, 76G LMDB) and ran targeted tests. Quick report:

Verified fixed

Bug 3 — no-match query latency (4ed4d5d df==0 early return):

"zxqvwbnmkjhfdsapoiuytrewq"   → eose  33ms
"qqqzzz999impossibletoken"    → eose  26ms
"xxxneverappearsanywhere2026" → eose  28ms
"aaabbbcccdddneverexisted"    → eose  26ms
"fakefakefakefakefaketoken"   → eose  27ms

Previously ranged 191ms → 20s+. Fast path is doing exactly what it should.

Bug 4 — writer wedge on d-tag deletes (5542a6f):

The journal shows Deleting event (d-tag). id=... lines processing live without stalling subsequent writes — that's the exact operation that used to self-deadlock the writer thread. Stress test of 10 concurrent publishes completed in 98ms (29–80ms per OK), all accepted. Writer is healthy under load.

Likely fixed (need full reindex to prove)

Bug 1 — free(): double free during search_reindex at LevId ~7.6M:

The 13b36f5 commit description is a clean match for the failure mode: a std::string_view into LMDB-managed memory used after the page could be invalidated by the write txn opened on the same thread. Owning the bytes before the write txn opens is the right fix.

I haven't re-run a full reindex yet to confirm (multi-hour on 26M events, and the tool only supports --restart from levId=1 — no --from N flag, which would make this much cheaper to validate). If you'd be open to it, a --from-levid option on search_reindex would let operators verify fixes against known-bad ranges without doing a full rebuild.

Bug 2 — intermittent crashes every 2–3 days:

Was previously seeing SIGABRT/SIGSEGV restarts roughly every 2–3 days even without reindex activity. The 13b36f5 fixes (same string_view lifetime issue in runCatchupIndexer, plus std::once_flag for KindMatcher) look like plausible root causes. I'll let this soak for a week and report back.

Other observations

Indexer is keeping up well in live mode — batches of 100–700 events processed in <20ms, 10s tick interval. Most events get skipped (outside indexed kinds), small indexed=N counts each tick. No backlog growth observed.

Publish → searchable latency: new event findable in search ~500ms after OK. Great.

Common-term query latency is mostly 200–310ms, with one outlier:

"bitcoin"   → first 11435ms / eose 11436ms
"nostr"     → first  301ms / eose  310ms
"hello"     → first  296ms / eose  297ms
"good"      → first  224ms / eose  224ms
"love"      → first  209ms / eose  234ms
"lightning" → first  215ms / eose  226ms

bitcoin is presumably a very high-df term with a massive inverted list. Worth checking whether the BM25 candidate ranking has an upper-bound on docs scored, or if there's room for a top-k early termination over the postings list. Not a regression from the previous build — just calling it out as the next perf hotspot worth looking at.

Parallel search: 20 concurrent queries, median eose 1.8s, max 2.7s — reasonable. No errors.

search_index_stats still hangs when relay is running — same as before, not a new issue, just a reminder.

Otherwise: clean startup, 0 restarts since deploy, no signals/aborts in journal. Will let it soak and report any regressions. Nice work on the fixes — the diagnosis on Bug 1 was sharp.

dskvr · 2026-05-25T10:18:59Z

bitcoin is presumably a very high-df term with a massive inverted lis

It should be capped to maxPostingsPerToken

Worth checking whether the BM25 candidate ranking has an upper-bound on docs scored

@leesalminen Could you try tweaking the config values, primarily maxPostingsPerToken, drop it to 5000 and see if that helps.

        # Maximum number of postings (documents) per search token
        maxPostingsPerToken = 100000

        # Maximum candidate documents to fetch during search (multiple of limit)
        maxCandidateDocs = 1000

Notes:

The defaults provided for search config are a bit high, should be much lower.
Need to verify that when modifying maxPostingsPerToken that the behavior is indempotent.

dskvr · 2026-05-25T10:23:17Z

f you'd be open to it, a --from-levid option on search_reindex

I'll add a --from-levid argument on search_reindex, very minor feature.

- Re-index from a specific levId without clearing the existing index - Does not touch SearchState so normal resume continues to work - Mutually exclusive with --restart - Enables operators to validate fixes against known-bad ranges without a multi-hour full rebuild (PR hoytech#160 reviewer request)

dskvr · 2026-05-25T12:29:02Z

@leesalminen --from-levid is on feature/nip-50 (c9ba268):

./strfry search_reindex --from-levid=7500000

Doesn't clear the index, doesn't touch SearchState, your background catch-up keeps its place. Mutually exclusive with --restart. No --to-levid for now; ping if you want one.

Looking at the bitcoin 11.4s outlier next.

dskvr · 2026-05-25T12:59:48Z

@leesalminen pause the perf work; I found something bigger.

Probing the indexer: SearchIndex is plain MDB_DUPSORT but postings are stored as host-endian uint64. On x86_64 that means dup ordering is memcmp over little-endian bytes, which isn't numeric. Whenever a new levId crosses a 256/65536/16M boundary, MDB_APPENDDUP returns MDB_KEYEXIST and lmdbxx::dbi_put silently swallows it (returns false, doesn't throw; caller doesn't check).

Minimal probe with production packing, levIds 1..16777216:

APPENDDUP errors: 4 / 10
FIRST_DUP order: 1, 2, 254, 255, 65535, 16777215

So your SearchIndex is sparse, every token with postings spanning byte boundaries is missing matches. bitcoin's 11s is walking an incomplete, wrong-ordered posting list.

Plan:

Add MDB_INTEGERDUP | MDB_DUPFIXED (or big-endian pack), bump kIndexVersion so existing indexes are flagged stale.
Make dbi_put failures loud at the indexer.
Then revisit bitcoin latency on a correct index.

Hold off on a fresh reindex and on lowering maxPostingsPerToken, you'll need to rebuild once the fix lands. I'll ping back when it's on feature/nip-50.

Postings were packed as host-endian uint64 and stored in a plain MDB_DUPSORT DBI, so dup ordering used memcmp on little-endian bytes — which is not numeric. MDB_APPENDDUP returned MDB_KEYEXIST whenever a new levId's byte sequence was less than the prior stored entry (every 256/65536/16M boundary crossing), and lmdb::dbi_put silently swallows MDB_KEYEXIST. The caller didn't check the return value, so an unknown fraction of postings was silently dropped on every existing index. Empirical probe: 4/10 inserts dropped on monotonic levIds 1..16M. This commit: - packPosting/unpackPosting now use htobe64/be64toh so the stored bytes memcmp-compare in numeric order. APPENDDUP succeeds and postings are stored in levId-ascending order as the read and trim paths assume. - kIndexVersion bumped 1 -> 2. The old format is incompatible with the new packing, so existing operators must run `search_reindex --restart` to rebuild. - LmdbSearchProvider gains a lazy stale-detection guard (std::once_flag). If the on-disk indexVersion doesn't match kIndexVersion (or indexVersion==0 with leftover data from an interrupted pre-upgrade rebuild), staleIndex is set and indexEvent / indexEventWithTxnHook / deleteEvent / query / runCatchupIndexer / healthy() all short-circuit with a loud log instructing the operator to run search_reindex --restart. - Both dbi_SearchIndex.put and dbi_SearchDocMeta.put callsites now check the return value and log via LE on unexpected MDB_KEYEXIST. With big-endian packing in place these should never fire; any log is a real bug signal. - cmd_search_reindex gains an explicit stale guard: without --restart, on a stale or partially-rebuilt index, it errors out cleanly instead of silently no-op'ing through the provider's stale gate. - cmd_search_reindex --restart now uses dbi.drop(txn, false) to empty SearchIndex and SearchDocMeta. The previous cursor-delete loop walked MDB_NEXT_NODUP and called cursor.del() with no flag, which only deleted one duplicate per key — leaving most DUPSORT data behind. This bug pre-dates the byte-order fix but was surfaced by the new stale-detection logic finding leftover data after a supposed clear.

dskvr · 2026-05-25T13:15:54Z

@leesalminen Schema fix is on feature/nip-50 (fba11d7).

Changes:

Postings now stored big-endian, so MDB_DUPSORT memcmp ordering matches numeric. MDB_APPENDDUP succeeds for every levId.
kIndexVersion bumped 1 to 2. Existing index is detected as stale.
Stale gate refuses to index, refuses to serve search, and logs a loud LE line telling you to run search_reindex --restart.
dbi_put return values now checked and logged on failure. No more silent drops.
Also fixed a latent bug in --restart: the clear loop was only deleting one dup per key. Now uses dbi.drop(txn, false).

Upgrade path on your relay:

Stop strfry, pull feature/nip-50, rebuild, restart.
You'll see Search index is stale (stored version 1, expected 2) in the log. Search returns empty, indexer paused, by design.
./strfry search_reindex --restart to rebuild in the new format.
Restart strfry. Done.

After the rebuild, please re-measure the bitcoin query. Your prior numbers were against incomplete data. The corpus stats (getTotalDocs / getAvgDocLen floor) and the BM25 candidate ranking are unchanged, so I expect the floor to look similar but the high-DF spike to look quite different on a complete posting list.

Covers the property that MDB_DUPSORT depends on without a custom comparator: memcmp on packed posting bytes must match numeric order on the underlying levId. Catches the silent posting-drop regression just fixed in fba11d7. Four test layers: - Pack/unpack roundtrip across adversarial levIds (boundary values spanning every 8-bit byte: 0, 254, 255, 256, 65535, 65536, 16M-1, 16M, 2^47) crossed with tfs in {1, 2, 65535}. - memcmp ordering: sgn(memcmp(pack(a), pack(b))) == sgn(a-b) over every pair from the same boundary set. - Tripwire constant: packPosting(0x1234, 0xABCD) bytes must equal 00 00 00 00 12 34 AB CD. Any change to the pack formula is loud. - Live MDB_DUPSORT integration in a temp env via mkdtemp: MDB_APPENDDUP-insert for all boundary levIds, assert zero MDB_KEYEXIST returns, walk FIRST_DUP -> NEXT_DUP, assert numeric iteration order. Smoke-tested by temporarily reverting packPosting to host-endian: test produced 321 failures including 8 live MDB_APPENDDUP rejections matching the original empirically-discovered bug. Build target follows the existing test/SubIdTests.cpp pattern: `make test-search-posting`.

… as exit 1 Adds test/searchReindexTest.pl covering three CLI-level regressions surfaced during PR hoytech#160 testing: 1. Stale-index detection: search_reindex without --restart must refuse to operate on a v1 stored index, exit non-zero, and instruct the operator to re-run with --restart. The test plants a v1 indexVersion via `search_set_state --index-version=1 --allow-lower` and asserts the error message and exit code. 2. --restart idempotency: running search_reindex --restart twice in a row must index the same number of events. The latent cursor-delete bug (only deleting one dup per key, fixed in fba11d7) would cause the second pass to differ. Catches future regressions in the clear path. 3. --from-levid does not touch SearchState: build a v2 index, plant a sentinel lastIndexedLevId via search_set_state, run search_reindex --from-levid=1, then verify the sentinel survives by re-running search_set_state and asserting on the "previous:" echo. Smoke-tested by toggling persistState=true in the partial branch -- the test catches the regression as expected. Also converts cmd_search_reindex's error paths (search disabled, wrong backend, --restart/--from-levid mutex, stale index, in-progress mismatch, --from-levid<1) from `std::cerr; return` to `throw herr(...)`. main() catches std::exception, prints, and exits 1, so this is required for scripting/tests to detect failures. Matches the existing convention in cmd_search_set_state. Test config added at test/cfgs/searchTest.conf (search enabled, indexedKinds = "*"). Mirrors test/writeTest.pl style: nostril-signed kind-1 events imported via `./strfry import`, no WebSocket harness. Run with: perl test/searchReindexTest.pl

dskvr marked this pull request as ready for review November 12, 2025 14:18

dskvr force-pushed the feature/nip-50 branch from a2d41cc to 0aa429c Compare April 27, 2026 22:58

dskvr marked this pull request as draft April 27, 2026 23:02

dskvr force-pushed the feature/nip-50 branch from 0aa429c to a2d41cc Compare April 27, 2026 23:04

dskvr added 7 commits April 28, 2026 01:14

dskvr force-pushed the feature/nip-50 branch from a2d41cc to c67cd48 Compare April 27, 2026 23:25

dskvr marked this pull request as ready for review April 27, 2026 23:36

Merge branch 'master' into feature/nip-50

c04ed93

dskvr added 4 commits May 24, 2026 11:50

bugfix: wait for transaction close before deleting from index

5542a6f

chore: merge quick task worktree (worktree-agent-a49b022c0d9fe433b)

24cbec8

dskvr added 2 commits May 25, 2026 15:26

dskvr marked this pull request as draft May 26, 2026 15:59

Conversation

dskvr commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Core Components

Database Schema

Configuration

Configuration Parameters

Usage

Enabling Search

Search Queries

Monitoring

Performance Characteristics

Indexing Performance

Query Performance

Benchmark Suite

Running Benchmarks

Benchmark Metrics

Testing

Manual Testing

Integration Points

Migration Notes

Existing Databases

Rollback

Known Limitations

Future Enhancements

Related Issues

TODO List before eligible as candidate for merge

Uh oh!

leesalminen commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hoytech commented Feb 27, 2026

Uh oh!

dskvr commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leesalminen commented May 21, 2026

Testing report from production relay (~26M events, 52 GiB LMDB, 2 weeks uptime)

✅ What works well

🐛 Bug 1 — search_reindex --restart aborts on bad LevId

🐛 Bug 2 — intermittent crashes during long-term running

🐛 Bug 3 — wildly inconsistent no-match query latency

🐛 Bug 4 — search indexer wedges the Writer thread after catch-up

Misc observations / doc asks

Uh oh!

leesalminen commented May 21, 2026

Bug 4 fix — verified working on prod

Uh oh!

leesalminen commented May 22, 2026

Patched build — 21 h prod soak

Uh oh!

dskvr commented May 23, 2026

Uh oh!

leesalminen commented May 25, 2026

Verified fixed

Likely fixed (need full reindex to prove)

Other observations

Uh oh!

dskvr commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented May 25, 2026

Uh oh!

dskvr commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dskvr commented May 25, 2026

Uh oh!

Reviewers

dskvr commented Nov 12, 2025 •

edited

Loading

leesalminen commented Nov 18, 2025 •

edited

Loading

dskvr commented Mar 2, 2026 •

edited

Loading

dskvr commented Mar 3, 2026 •

edited

Loading

dskvr commented Apr 27, 2026 •

edited

Loading

🐛 Bug 1 — `search_reindex --restart` aborts on bad LevId

dskvr commented May 25, 2026 •

edited

Loading

dskvr commented May 25, 2026 •

edited

Loading

dskvr commented May 25, 2026 •

edited

Loading