feat(tantivy): add Tantivy full-text global index via Rust FFI#346
feat(tantivy): add Tantivy full-text global index via Rust FFI#346spaces-X wants to merge 21 commits into
Conversation
aa9c415 to
c63e4b7
Compare
3f85996 to
2741953
Compare
Add a Rust tantivy-based FTS global index as a second backend alongside Lucene, wired into CMake via cbindgen + Corrosion, with 10 functional unit tests.
…tests Cross-read tests for tantivy archives shared between paimon-java and paimon-cpp, using fixtures from paimon-java's TantivyIndexFixtureGen and covering both directions.
Companion infra for the tantivy-fts integration (no production logic): devcontainer, CI workflows, sanitizer flags, and cross-platform build fixes.
Fix io_meta being null on the reader path and the jieba dictionary directory not being set when constructing the tantivy index.
Install the log bridge once on first reader Create so Rust log records surface through glog in production binaries, not only in unit tests.
…ctor for unscored search Replace the DocSetCollector + HashSet + per-doc fast-field path with a RowIdCollector that opens the row_id column once per segment and reads it inline.
Repurpose Path B as a true unscored LIMIT N: LimitedDocSetCollector stops collecting past N via a shared atomic, skipping BM25 scoring entirely.
Add an optional min_score applied after scoring but before sort/truncate, letting FE push `score() > X` down through the FFI into the tantivy engine.
Adapt to base AddBatch gaining relative_row_ids and GlobalIndexIOMeta dropping range_end, mirroring lucene; update the 8 affected tantivy test files.
setup_rust.sh pins rustc 1.88.0 (min required by the transitive time crate); build_paimon.sh turns off PAIMON_ENABLE_TANTIVY on the gcc-8 image (no Rust there), mirroring the existing LUMINA/LANCE handling.
Expand the abbreviated 'Licensed under the Apache License, Version 2.0.' line to the full Apache 2.0 boilerplate so the RAT license check recognizes it.
…nt, codespell) Apply clang-format/cmake-format; fix cpplint (functional char-casts -> static_cast, int64_t/PRId64 instead of long, NOLINT for the cbindgen-generated header include) and a codespell typo.
testharness.cpp includes <gtest/gtest.h> but the objlib compile step has no ordering dependency on the googletest ExternalProject, so it can race ahead of header extraction (flaky 'gtest/gtest.h: No such file' in Release). Add an explicit add_dependencies.
Address review feedback (avoid Chinese in code): translate comments/docs in CMake, the smoke script, the Rust FFI crate (Cargo.toml / cbindgen.toml / build.rs / tokenizer.rs / callback_directory.rs) and the test-fixture READMEs. Chinese is kept only in tokenizer test data (the jieba CJK tokenization inputs/expectations).
…eaders Move the kJiebaDictDirEnv lookup into a single GetJiebaDictionaryDirFromEnv helper in tantivy_defs.h so the writer and reader stop defining their own copies; each call site keeps its own missing-dir policy. Expand the remaining short-form Apache headers in the tantivy sources, devcontainer files and CorrosionFetch.cmake to the full boilerplate.
71f1670 to
8aa3179
Compare
|
Thank you very much for this contribution. Adding support for Tantivy is a very meaningful improvement and will help us better support full-text search scenarios. While since this PR is quite large, we’ll need some time to review it. Thanks for your understanding and patience. |
Thanks for taking the time to review! |
- default PAIMON_ENABLE_TANTIVY off (CI still builds it on) - add with_score/min_score to FullTextSearch ctor - ParseArchiveHeader -> ArchiveLayout::Parse; validate via InRange - bulk-insert row_ids with AddMany on the unscored read path - lucene: return NotImplemented when min_score is set - writer stream buffer 64KB -> 1MB (heap-allocated) - tests use ASSERT for guard checks; drop process comments and [BUG_*] markers
| for (int64_t i = 0; i < array->length(); ++i) relative_row_ids[i] = i; | ||
| EXPECT_TRUE(writer_res.value()->AddBatch(&c_array, std::move(relative_row_ids)).ok()); | ||
| auto metas_res = writer_res.value()->Finish(); | ||
| EXPECT_TRUE(metas_res.ok()) << metas_res.status().ToString(); |
There was a problem hiding this comment.
Please prefer the test macros from testharness.h in tests, such as EXPECT_OK / ASSERT_OK and EXPECT_OK_AND_ASSIGN / ASSERT_OK_AND_ASSIGN.
|
This PR looks very solid overall, and the tests are also quite thorough. Bridging Rust and C++ is not easy, so thank you for the contribution. Once the formatting issues mentioned above are addressed, we’ll take another look and continue a quick review. |
- fix StreamCtx double-free: Rust owns ctx on entry to paimon_tantivy_reader_new_streaming and releases it on every failure path; C++ no longer releases on failure. Add a Rust regression test. - tests: prefer ASSERT over EXPECT in test bodies (keep EXPECT in value-returning helpers / thread lambdas where ASSERT is invalid) - drop test-only setenv: GetJiebaDictionaryDirFromEnv falls back to the JIEBA_TEST_DICT_DIR macro (moved to tantivy_defs.cpp, single TU) - remove leftover AI/process comments from the test files - brace single-line for-loops; use testharness OK macros in equivalence test
| std::string json = "["; | ||
| for (int i = 0; i < kDocCount; ++i) { | ||
| json += "[\""; | ||
| int n = word_count(rng); |
There was a problem hiding this comment.
Could you please prefer int32_t over int in paimon-cpp.
| const std::string& w = vocab[word_pick(rng)]; | ||
| auto r = lreader->VisitFullTextSearch(std::make_shared<FullTextSearch>( | ||
| "f0", std::nullopt, w, FullTextSearch::SearchType::MATCH_ALL, std::nullopt)); | ||
| ASSERT_TRUE(r.ok()); |
| auto r = treader->VisitFullTextSearch(std::make_shared<FullTextSearch>( | ||
| "f0", std::nullopt, w, FullTextSearch::SearchType::MATCH_ALL, std::nullopt)); | ||
| ASSERT_TRUE(r.ok()); | ||
| } |
| fts->with_score = true; // v0.2: explicit score opt-in | ||
| auto res = reader->VisitFullTextSearch(fts); | ||
| ASSERT_TRUE(res.ok()) << res.status().ToString(); | ||
| auto scored = std::dynamic_pointer_cast<BitmapScoredGlobalIndexResult>(res.value()); |
There was a problem hiding this comment.
Please prefer ASSERT_OK_AND_ASSIGN.
| auto b = plain->GetBitmap(); | ||
| EXPECT_OK(b.status()) << b.status().ToString(); | ||
| if (b.ok()) bitmap = b.value(); | ||
| } else if (auto scored = std::dynamic_pointer_cast<BitmapScoredGlobalIndexResult>(r)) { |
There was a problem hiding this comment.
Prefer EXPECT_OK_AND_ASSIGN rather than EXPECT_OK(b.status()) and use {} for single line if and for.
There was a problem hiding this comment.
Please help go through the codebase and fix similar issues as well.
| std::cerr << " [" << i << "] " << layout.names[i] << " offset=" << layout.offsets[i] | ||
| << " length=" << layout.lengths[i] << "\n"; | ||
| } | ||
|
|
There was a problem hiding this comment.
Could you please remove debug cerr?
| } | ||
|
|
||
| TEST_F(TantivyReaderTest, ChineseQueryMode) { | ||
| auto array = arrow::ipc::internal::json::ArrayFromJSON(DataType(), R"([ |
There was a problem hiding this comment.
I want to confirm whether reader_test and index_test are redundant. It seems that index_test already provides more complete coverage. If so, could we remove reader_test?
| } | ||
|
|
||
| namespace paimon::tantivy { | ||
|
|
There was a problem hiding this comment.
For tests, please use the namespace paimon::tantivy::test.
| auto r = ArchiveLayout::Parse(&in); | ||
| ASSERT_FALSE(r.ok()); | ||
| ASSERT_NE(r.status().message().find("bad file_count"), std::string::npos) | ||
| << r.status().ToString(); |
| // ========================================================================= | ||
|
|
||
| TEST_F(StreamingTestFixture, StreamingBenchmarkLog) { | ||
| auto rss_kb = []() { |
There was a problem hiding this comment.
Is this case really necessary? TantivyEquivalenceTest already includes BenchmarkBuildAndQuery.
| * `hmm` mode is tested separately: FFI must return Unsupported. | ||
| */ | ||
|
|
||
| #include <algorithm> |
There was a problem hiding this comment.
I understand that cppjieba and jieba-rs may produce different token sequences, so we do not need to require parity between them.
However, could we make the expected output explicit for each input instead? As it stands, these tests look like regular assertions, but they are effectively reporting diffs to stderr and always succeeding. That means if the number of differences increases or decreases, CI will still pass and the diff output may be easy to miss in the logs.
A more robust approach might be to keep the cppjieba-vs-jieba-rs diff as a separate advisory report, while making the unit test assert stable expected outputs for jieba-rs on curated inputs.
| if (files.size() != 1) { | ||
| return Status::Invalid("tantivy index only has one index file per shard, now num: {}", | ||
| files.size()); | ||
| } |
There was a problem hiding this comment.
Bug: Status::Invalid() does not perform fmt::format-style placeholder substitution internally — the literal {} will appear in the error message at runtime instead of the actual files.size() value.
Suggested fix:
return Status::Invalid(fmt::format("tantivy index only has one index file per shard, now num: {}",
files.size()));| namespace { | ||
|
|
||
| /// Level mapping matches Rust side (0=trace..4=error). | ||
| extern "C" void PaimonTantivyLogAdapter(int32_t level, const char* msg, std::size_t len) { |
There was a problem hiding this comment.
Nit: extern "C" function inside an anonymous namespace has contradictory linkage — anonymous namespace gives internal linkage while extern "C" implies external linkage. Some compilers (e.g. clang with -Wpedantic) will emit a warning.
Consider moving PaimonTantivyLogAdapter out of the anonymous namespace and marking it static instead, or placing it directly in namespace paimon::tantivy (since it is only referenced via function pointer, symbol visibility is not a concern).
|
|
||
| FROM ubuntu:24.04 | ||
|
|
||
| # Switch apt to Aliyun mirror for faster downloads (covers both |
There was a problem hiding this comment.
Hardcoded Aliyun/USTC mirrors will break or slow down builds for contributors outside mainland China. Consider parameterizing via ARG so the mirror URL can be overridden at build time without editing the Dockerfile:
ARG APT_MIRROR=http://archive.ubuntu.com/ubuntu
RUN sed -i "s|http://archive.ubuntu.com/ubuntu|${APT_MIRROR}|g" ...Alternatively, move the mirror setup into a separate optional script.
| // This is a reportable baseline, NOT a perf gate — assertions only check | ||
| // semantic correctness (each query returns >= 0 docs without erroring). | ||
| constexpr int kDocCount = 200; | ||
| constexpr int kQueryCount = 100; |
There was a problem hiding this comment.
Style: per docs/code-style.md, fixed-width integer types should be used instead of plain int. This applies to test files as well.
// Current:
constexpr int kDocCount = 200;
constexpr int kQueryCount = 100;
for (int i = 0; i < kDocCount; ++i) {
// Suggested:
constexpr int32_t kDocCount = 200;
constexpr int32_t kQueryCount = 100;
for (int32_t i = 0; i < kDocCount; ++i) {Same issue exists in tantivy_ffi_test.cpp, tantivy_lucene_coexist_test.cpp, and tantivy_streaming_test.cpp.
Purpose
Add an experimental
tantivy-fulltextglobal index backend alongsidelucene-fts.This change:
tantivy/jieba-rsFFI crate into CMake via Corrosion and cbindgenlimit,pre_filter, BM25 score opt-in, andmin_scorefilteringTests
Added / covered by:
cargo test --manifest-path third_party/tantivy_ffi/Cargo.tomlpaimon-tantivy-smoke-testpaimon-tantivy-ffi-testpaimon-tantivy-tokenizer-testpaimon-tantivy-writer-testpaimon-tantivy-reader-testpaimon-tantivy-filter-limit-testpaimon-tantivy-index-testpaimon-tantivy-streaming-testpaimon-tantivy-java-compat-testpaimon-tantivy-lucene-coexist-testpaimon-tantivy-equivalence-testpaimon-global-index-testAPI and Format
Yes.
API:
include/paimon/predicate/full_text_search.haddswith_scoreandmin_score.limitis now a truncation switch and no longer implies BM25 score computation.ReplacePreFilterpreserves scoring-related flags.Format:
tantivy-fulltextpacked archive format compatible with paimon-java Tantivy archives.lucene-ftsstorage format is not changed.Protocol:
Documentation
Yes, this introduces a new experimental
tantivy-fulltextglobal index feature.This patch includes fixture READMEs and smoke-test script usage, but no separate user-facing documentation page.
Generative AI tooling
Generated-by: Codex (GPT-5) and Claude Opus 4.8