feat(tantivy): add Tantivy full-text global index via Rust FFI by spaces-X · Pull Request #346 · alibaba/paimon-cpp

spaces-X · 2026-06-08T06:04:43Z

Purpose

Add an experimental tantivy-fulltext global index backend alongside lucene-fts.

This change:

wires a Rust tantivy / jieba-rs FFI crate into CMake via Corrosion and cbindgen
adds C++ Tantivy global index writer, reader, factory registration, archive parsing, streaming I/O callbacks, and Rust log bridging
supports full-text search query types with limit, pre_filter, BM25 score opt-in, and min_score filtering
adds Java <-> C++ Tantivy archive compatibility fixtures and cross-read coverage
adds CI/devcontainer Rust setup and a targeted Tantivy smoke test script

Tests

Added / covered by:

cargo test --manifest-path third_party/tantivy_ffi/Cargo.toml
paimon-tantivy-smoke-test
paimon-tantivy-ffi-test
paimon-tantivy-tokenizer-test
paimon-tantivy-writer-test
paimon-tantivy-reader-test
paimon-tantivy-filter-limit-test
paimon-tantivy-index-test
paimon-tantivy-streaming-test
paimon-tantivy-java-compat-test
paimon-tantivy-lucene-coexist-test
paimon-tantivy-equivalence-test
paimon-global-index-test

API and Format

Yes.

API:

include/paimon/predicate/full_text_search.h adds with_score and min_score.
limit is now a truncation switch and no longer implies BM25 score computation.
ReplacePreFilter preserves scoring-related flags.

Format:

Adds a new tantivy-fulltext packed archive format compatible with paimon-java Tantivy archives.
Existing lucene-fts storage format is not changed.

Protocol:

No external protocol change. The new Rust/C FFI boundary is internal to the Tantivy backend.

Documentation

Yes, this introduces a new experimental tantivy-fulltext global index feature.

This patch includes fixture READMEs and smoke-test script usage, but no separate user-facing documentation page.

Generative AI tooling

Generated-by: Codex (GPT-5) and Claude Opus 4.8

CLAassistant · 2026-06-08T06:04:51Z

All committers have signed the CLA.

Add a Rust tantivy-based FTS global index as a second backend alongside Lucene, wired into CMake via cbindgen + Corrosion, with 10 functional unit tests.

…tests Cross-read tests for tantivy archives shared between paimon-java and paimon-cpp, using fixtures from paimon-java's TantivyIndexFixtureGen and covering both directions.

Companion infra for the tantivy-fts integration (no production logic): devcontainer, CI workflows, sanitizer flags, and cross-platform build fixes.

Fix io_meta being null on the reader path and the jieba dictionary directory not being set when constructing the tantivy index.

Install the log bridge once on first reader Create so Rust log records surface through glog in production binaries, not only in unit tests.

…ctor for unscored search Replace the DocSetCollector + HashSet + per-doc fast-field path with a RowIdCollector that opens the row_id column once per segment and reads it inline.

Repurpose Path B as a true unscored LIMIT N: LimitedDocSetCollector stops collecting past N via a shared atomic, skipping BM25 scoring entirely.

Add an optional min_score applied after scoring but before sort/truncate, letting FE push `score() > X` down through the FFI into the tantivy engine.

Adapt to base AddBatch gaining relative_row_ids and GlobalIndexIOMeta dropping range_end, mirroring lucene; update the 8 affected tantivy test files.

setup_rust.sh pins rustc 1.88.0 (min required by the transitive time crate); build_paimon.sh turns off PAIMON_ENABLE_TANTIVY on the gcc-8 image (no Rust there), mirroring the existing LUMINA/LANCE handling.

Expand the abbreviated 'Licensed under the Apache License, Version 2.0.' line to the full Apache 2.0 boilerplate so the RAT license check recognizes it.

…nt, codespell) Apply clang-format/cmake-format; fix cpplint (functional char-casts -> static_cast, int64_t/PRId64 instead of long, NOLINT for the cbindgen-generated header include) and a codespell typo.

testharness.cpp includes <gtest/gtest.h> but the objlib compile step has no ordering dependency on the googletest ExternalProject, so it can race ahead of header extraction (flaky 'gtest/gtest.h: No such file' in Release). Add an explicit add_dependencies.

Address review feedback (avoid Chinese in code): translate comments/docs in CMake, the smoke script, the Rust FFI crate (Cargo.toml / cbindgen.toml / build.rs / tokenizer.rs / callback_directory.rs) and the test-fixture READMEs. Chinese is kept only in tokenizer test data (the jieba CJK tokenization inputs/expectations).

…eaders Move the kJiebaDictDirEnv lookup into a single GetJiebaDictionaryDirFromEnv helper in tantivy_defs.h so the writer and reader stop defining their own copies; each call site keeps its own missing-dir policy. Expand the remaining short-form Apache headers in the tantivy sources, devcontainer files and CorrosionFetch.cmake to the full boilerplate.

lxy-9602 · 2026-06-11T05:18:48Z

Thank you very much for this contribution. Adding support for Tantivy is a very meaningful improvement and will help us better support full-text search scenarios. While since this PR is quite large, we’ll need some time to review it. Thanks for your understanding and patience.

spaces-X · 2026-06-11T07:29:24Z

Thank you very much for this contribution. Adding support for Tantivy is a very meaningful improvement and will help us better support full-text search scenarios. While since this PR is quite large, we’ll need some time to review it. Thanks for your understanding and patience.

Thanks for taking the time to review!
I completely understand it's a large change and will take time to review. Please don't hesitate to let me know if you have any questions on specific parts - happy to help make the review easier.

- default PAIMON_ENABLE_TANTIVY off (CI still builds it on) - add with_score/min_score to FullTextSearch ctor - ParseArchiveHeader -> ArchiveLayout::Parse; validate via InRange - bulk-insert row_ids with AddMany on the unscored read path - lucene: return NotImplemented when min_score is set - writer stream buffer 64KB -> 1MB (heap-allocated) - tests use ASSERT for guard checks; drop process comments and [BUG_*] markers

lxy-9602 · 2026-06-13T03:35:55Z

+        for (int64_t i = 0; i < array->length(); ++i) relative_row_ids[i] = i;
+        EXPECT_TRUE(writer_res.value()->AddBatch(&c_array, std::move(relative_row_ids)).ok());
+        auto metas_res = writer_res.value()->Finish();
+        EXPECT_TRUE(metas_res.ok()) << metas_res.status().ToString();


Please prefer the test macros from testharness.h in tests, such as EXPECT_OK / ASSERT_OK and EXPECT_OK_AND_ASSIGN / ASSERT_OK_AND_ASSIGN.

lxy-9602 · 2026-06-13T03:40:06Z

This PR looks very solid overall, and the tests are also quite thorough. Bridging Rust and C++ is not easy, so thank you for the contribution. Once the formatting issues mentioned above are addressed, we’ll take another look and continue a quick review.

- fix StreamCtx double-free: Rust owns ctx on entry to paimon_tantivy_reader_new_streaming and releases it on every failure path; C++ no longer releases on failure. Add a Rust regression test. - tests: prefer ASSERT over EXPECT in test bodies (keep EXPECT in value-returning helpers / thread lambdas where ASSERT is invalid) - drop test-only setenv: GetJiebaDictionaryDirFromEnv falls back to the JIEBA_TEST_DICT_DIR macro (moved to tantivy_defs.cpp, single TU) - remove leftover AI/process comments from the test files - brace single-line for-loops; use testharness OK macros in equivalence test

lxy-9602 · 2026-06-15T03:59:53Z

+    std::string json = "[";
+    for (int i = 0; i < kDocCount; ++i) {
+        json += "[\"";
+        int n = word_count(rng);


Could you please prefer int32_t over int in paimon-cpp.

lxy-9602 · 2026-06-15T04:01:07Z

+            const std::string& w = vocab[word_pick(rng)];
+            auto r = lreader->VisitFullTextSearch(std::make_shared<FullTextSearch>(
+                "f0", std::nullopt, w, FullTextSearch::SearchType::MATCH_ALL, std::nullopt));
+            ASSERT_TRUE(r.ok());


lxy-9602 · 2026-06-15T04:01:20Z

+            auto r = treader->VisitFullTextSearch(std::make_shared<FullTextSearch>(
+                "f0", std::nullopt, w, FullTextSearch::SearchType::MATCH_ALL, std::nullopt));
+            ASSERT_TRUE(r.ok());
+        }


lxy-9602 · 2026-06-15T05:35:18Z

+    fts->with_score = true;  // v0.2: explicit score opt-in
+    auto res = reader->VisitFullTextSearch(fts);
+    ASSERT_TRUE(res.ok()) << res.status().ToString();
+    auto scored = std::dynamic_pointer_cast<BitmapScoredGlobalIndexResult>(res.value());


Please prefer ASSERT_OK_AND_ASSIGN.

lxy-9602 · 2026-06-15T05:48:22Z

+            auto b = plain->GetBitmap();
+            EXPECT_OK(b.status()) << b.status().ToString();
+            if (b.ok()) bitmap = b.value();
+        } else if (auto scored = std::dynamic_pointer_cast<BitmapScoredGlobalIndexResult>(r)) {


Prefer EXPECT_OK_AND_ASSIGN rather than EXPECT_OK(b.status()) and use {} for single line if and for.

Please help go through the codebase and fix similar issues as well.

lxy-9602 · 2026-06-15T05:52:09Z

+        std::cerr << "  [" << i << "] " << layout.names[i] << "  offset=" << layout.offsets[i]
+                  << "  length=" << layout.lengths[i] << "\n";
+    }
+


Could you please remove debug cerr?

lxy-9602 · 2026-06-15T06:27:18Z

+}
+
+TEST_F(TantivyReaderTest, ChineseQueryMode) {
+    auto array = arrow::ipc::internal::json::ArrayFromJSON(DataType(), R"([


I want to confirm whether reader_test and index_test are redundant. It seems that index_test already provides more complete coverage. If so, could we remove reader_test?

lxy-9602 · 2026-06-15T06:32:43Z

+}
+
+namespace paimon::tantivy {
+


For tests, please use the namespace paimon::tantivy::test.

lxy-9602 · 2026-06-15T06:35:04Z

+    auto r = ArchiveLayout::Parse(&in);
+    ASSERT_FALSE(r.ok());
+    ASSERT_NE(r.status().message().find("bad file_count"), std::string::npos)
+        << r.status().ToString();


ASSERT_NOK_WITH_MSG

lxy-9602 · 2026-06-15T06:41:31Z

+// =========================================================================
+
+TEST_F(StreamingTestFixture, StreamingBenchmarkLog) {
+    auto rss_kb = []() {


Is this case really necessary? TantivyEquivalenceTest already includes BenchmarkBuildAndQuery.

lxy-9602 · 2026-06-15T06:57:04Z

+ * `hmm` mode is tested separately: FFI must return Unsupported.
+ */
+
+#include <algorithm>


I understand that cppjieba and jieba-rs may produce different token sequences, so we do not need to require parity between them.

However, could we make the expected output explicit for each input instead? As it stands, these tests look like regular assertions, but they are effectively reporting diffs to stderr and always succeeding. That means if the number of differences increases or decreases, CI will still pass and the diff output may be easy to miss in the logs.

A more robust approach might be to keep the cppjieba-vs-jieba-rs diff as a separate advisory report, while making the unit test assert stable expected outputs for jieba-rs on curated inputs.

zjw1111 · 2026-06-16T05:31:35Z

+    if (files.size() != 1) {
+        return Status::Invalid("tantivy index only has one index file per shard, now num: {}",
+                               files.size());
+    }


Bug: Status::Invalid() does not perform fmt::format-style placeholder substitution internally — the literal {} will appear in the error message at runtime instead of the actual files.size() value.

Suggested fix:

return Status::Invalid(fmt::format("tantivy index only has one index file per shard, now num: {}", files.size()));

zjw1111 · 2026-06-16T05:31:35Z

+namespace {
+
+/// Level mapping matches Rust side (0=trace..4=error).
+extern "C" void PaimonTantivyLogAdapter(int32_t level, const char* msg, std::size_t len) {


Nit: extern "C" function inside an anonymous namespace has contradictory linkage — anonymous namespace gives internal linkage while extern "C" implies external linkage. Some compilers (e.g. clang with -Wpedantic) will emit a warning.

Consider moving PaimonTantivyLogAdapter out of the anonymous namespace and marking it static instead, or placing it directly in namespace paimon::tantivy (since it is only referenced via function pointer, symbol visibility is not a concern).

zjw1111 · 2026-06-16T05:31:35Z


 FROM ubuntu:24.04

+# Switch apt to Aliyun mirror for faster downloads (covers both


Hardcoded Aliyun/USTC mirrors will break or slow down builds for contributors outside mainland China. Consider parameterizing via ARG so the mirror URL can be overridden at build time without editing the Dockerfile:

ARG APT_MIRROR=http://archive.ubuntu.com/ubuntu RUN sed -i "s|http://archive.ubuntu.com/ubuntu|${APT_MIRROR}|g" ...

Alternatively, move the mirror setup into a separate optional script.

zjw1111 · 2026-06-16T05:31:35Z

+    // This is a reportable baseline, NOT a perf gate — assertions only check
+    // semantic correctness (each query returns >= 0 docs without erroring).
+    constexpr int kDocCount = 200;
+    constexpr int kQueryCount = 100;


Style: per docs/code-style.md, fixed-width integer types should be used instead of plain int. This applies to test files as well.

// Current: constexpr int kDocCount = 200; constexpr int kQueryCount = 100; for (int i = 0; i < kDocCount; ++i) { // Suggested: constexpr int32_t kDocCount = 200; constexpr int32_t kQueryCount = 100; for (int32_t i = 0; i < kDocCount; ++i) {

Same issue exists in tantivy_ffi_test.cpp, tantivy_lucene_coexist_test.cpp, and tantivy_streaming_test.cpp.

spaces-X force-pushed the baseline-tantivy branch 2 times, most recently from aa9c415 to c63e4b7 Compare June 9, 2026 01:56

lszskye reviewed Jun 9, 2026

View reviewed changes

Comment thread cmake_modules/BuildUtils.cmake Outdated

lszskye reviewed Jun 10, 2026

View reviewed changes

Comment thread src/paimon/global_index/tantivy/tantivy_archive_layout.cpp

lszskye reviewed Jun 10, 2026

View reviewed changes

Comment thread src/paimon/global_index/tantivy/tantivy_global_index_writer.cpp Outdated

spaces-X force-pushed the baseline-tantivy branch from 3f85996 to 2741953 Compare June 10, 2026 05:10

spaces-X and others added 17 commits June 10, 2026 17:08

feat(tantivy): Tantivy-fts global index integration via Rust FFI

af6169a

Add a Rust tantivy-based FTS global index as a second backend alongside Lucene, wired into CMake via cbindgen + Corrosion, with 10 functional unit tests.

test(tantivy): Java <-> C++ tantivy archive cross-read compatibility …

1809c55

…tests Cross-read tests for tantivy archives shared between paimon-java and paimon-cpp, using fixtures from paimon-java's TantivyIndexFixtureGen and covering both directions.

chore: CI / dev container / sanitizer + cross-platform fixes

a7346ad

Companion infra for the tantivy-fts integration (no production logic): devcontainer, CI workflows, sanitizer flags, and cross-platform build fixes.

fix(tantivy): fix io_meta is null and jieba dir not be set

35711b9

Fix io_meta being null on the reader path and the jieba dictionary directory not being set when constructing the tantivy index.

chore(tantivy_ffi): install log bridge

56ae38d

Install the log bridge once on first reader Create so Rust log records surface through glog in production binaries, not only in unit tests.

refactor(tantivy_ffi): Read row_id fast field inline via custom colle…

be16850

…ctor for unscored search Replace the DocSetCollector + HashSet + per-doc fast-field path with a RowIdCollector that opens the row_id column once per segment and reads it inline.

feat(tantivy_ffi): unscored LIMIT pushdown via LimitedDocSetCollector

c513145

Repurpose Path B as a true unscored LIMIT N: LimitedDocSetCollector stops collecting past N via a shared atomic, skipping BM25 scoring entirely.

feat(tantivy): add min_score threshold filtering to FullTextSearch

2432970

Add an optional min_score applied after scoring but before sort/truncate, letting FE push `score() > X` down through the FFI into the tantivy engine.

fix(tantivy): adapt to GlobalIndexWriter / GlobalIndexIOMeta API change

0b1b4f0

Adapt to base AddBatch gaining relative_row_ids and GlobalIndexIOMeta dropping range_end, mirroring lucene; update the 8 affected tantivy test files.

fix(tantivy): preserve full-text pre-filter score semantics

397bead

ci(tantivy): use Rust 1.88 and skip tantivy on gcc-8

0fe910b

setup_rust.sh pins rustc 1.88.0 (min required by the transitive time crate); build_paimon.sh turns off PAIMON_ENABLE_TANTIVY on the gcc-8 image (no Rust there), mirroring the existing LUMINA/LANCE handling.

chore(tantivy): use the full Apache license header in tantivy sources

ed466a9

Expand the abbreviated 'Licensed under the Apache License, Version 2.0.' line to the full Apache 2.0 boilerplate so the RAT license check recognizes it.

style(tantivy): satisfy pre-commit (clang-format, cmake-format, cppli…

5958931

…nt, codespell) Apply clang-format/cmake-format; fix cpplint (functional char-casts -> static_cast, int64_t/PRId64 instead of long, NOLINT for the cbindgen-generated header include) and a codespell typo.

fix(build): fix clang error and sanitizer error

d749ead

spaces-X force-pushed the baseline-tantivy branch from 71f1670 to 8aa3179 Compare June 10, 2026 09:08

lxy-9602 reviewed Jun 11, 2026

View reviewed changes

Comment thread CMakeLists.txt

Merge branch 'main' into baseline-tantivy

f7b18b7

lxy-9602 reviewed Jun 13, 2026

View reviewed changes

Comment thread src/paimon/global_index/tantivy/tantivy_equivalence_test.cpp Outdated

spaces-X and others added 2 commits June 14, 2026 19:24

Merge branch 'main' into baseline-tantivy

719597f

lxy-9602 reviewed Jun 15, 2026

View reviewed changes

zjw1111 reviewed Jun 16, 2026

View reviewed changes


		FROM ubuntu:24.04

		# Switch apt to Aliyun mirror for faster downloads (covers both

Conversation

spaces-X commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

CLAassistant commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lxy-9602 commented Jun 11, 2026

Uh oh!

spaces-X commented Jun 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lxy-9602 commented Jun 13, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

spaces-X commented Jun 8, 2026 •

edited

Loading

CLAassistant commented Jun 8, 2026 •

edited

Loading