Skip to content

feat(tantivy): add Tantivy full-text global index via Rust FFI#346

Open
spaces-X wants to merge 21 commits into
alibaba:mainfrom
spaces-X:baseline-tantivy
Open

feat(tantivy): add Tantivy full-text global index via Rust FFI#346
spaces-X wants to merge 21 commits into
alibaba:mainfrom
spaces-X:baseline-tantivy

Conversation

@spaces-X

@spaces-X spaces-X commented Jun 8, 2026

Copy link
Copy Markdown

Purpose

Add an experimental tantivy-fulltext global index backend alongside lucene-fts.

This change:

  • wires a Rust tantivy / jieba-rs FFI crate into CMake via Corrosion and cbindgen
  • adds C++ Tantivy global index writer, reader, factory registration, archive parsing, streaming I/O callbacks, and Rust log bridging
  • supports full-text search query types with limit, pre_filter, BM25 score opt-in, and min_score filtering
  • adds Java <-> C++ Tantivy archive compatibility fixtures and cross-read coverage
  • adds CI/devcontainer Rust setup and a targeted Tantivy smoke test script

Tests

Added / covered by:

  • cargo test --manifest-path third_party/tantivy_ffi/Cargo.toml
  • paimon-tantivy-smoke-test
  • paimon-tantivy-ffi-test
  • paimon-tantivy-tokenizer-test
  • paimon-tantivy-writer-test
  • paimon-tantivy-reader-test
  • paimon-tantivy-filter-limit-test
  • paimon-tantivy-index-test
  • paimon-tantivy-streaming-test
  • paimon-tantivy-java-compat-test
  • paimon-tantivy-lucene-coexist-test
  • paimon-tantivy-equivalence-test
  • paimon-global-index-test

API and Format

Yes.

API:

  • include/paimon/predicate/full_text_search.h adds with_score and min_score.
  • limit is now a truncation switch and no longer implies BM25 score computation.
  • ReplacePreFilter preserves scoring-related flags.

Format:

  • Adds a new tantivy-fulltext packed archive format compatible with paimon-java Tantivy archives.
  • Existing lucene-fts storage format is not changed.

Protocol:

  • No external protocol change. The new Rust/C FFI boundary is internal to the Tantivy backend.

Documentation

Yes, this introduces a new experimental tantivy-fulltext global index feature.

This patch includes fixture READMEs and smoke-test script usage, but no separate user-facing documentation page.

Generative AI tooling

Generated-by: Codex (GPT-5) and Claude Opus 4.8

@CLAassistant

CLAassistant commented Jun 8, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@spaces-X spaces-X force-pushed the baseline-tantivy branch 2 times, most recently from aa9c415 to c63e4b7 Compare June 9, 2026 01:56
Comment thread cmake_modules/BuildUtils.cmake Outdated
Comment thread src/paimon/global_index/tantivy/tantivy_archive_layout.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_writer.cpp Outdated
spaces-X and others added 17 commits June 10, 2026 17:08
Add a Rust tantivy-based FTS global index as a second backend alongside Lucene,
wired into CMake via cbindgen + Corrosion, with 10 functional unit tests.
…tests

Cross-read tests for tantivy archives shared between paimon-java and paimon-cpp,
using fixtures from paimon-java's TantivyIndexFixtureGen and covering both directions.
Companion infra for the tantivy-fts integration (no production logic): devcontainer,
CI workflows, sanitizer flags, and cross-platform build fixes.
Fix io_meta being null on the reader path and the jieba dictionary directory
not being set when constructing the tantivy index.
Install the log bridge once on first reader Create so Rust log records surface
through glog in production binaries, not only in unit tests.
…ctor for unscored search

Replace the DocSetCollector + HashSet + per-doc fast-field path with a RowIdCollector
that opens the row_id column once per segment and reads it inline.
Repurpose Path B as a true unscored LIMIT N: LimitedDocSetCollector stops collecting
past N via a shared atomic, skipping BM25 scoring entirely.
Add an optional min_score applied after scoring but before sort/truncate, letting FE
push `score() > X` down through the FFI into the tantivy engine.
Adapt to base AddBatch gaining relative_row_ids and GlobalIndexIOMeta dropping
range_end, mirroring lucene; update the 8 affected tantivy test files.
setup_rust.sh pins rustc 1.88.0 (min required by the transitive time crate); build_paimon.sh turns off PAIMON_ENABLE_TANTIVY on the gcc-8 image (no Rust there), mirroring the existing LUMINA/LANCE handling.
Expand the abbreviated 'Licensed under the Apache License, Version 2.0.' line to the full Apache 2.0 boilerplate so the RAT license check recognizes it.
…nt, codespell)

Apply clang-format/cmake-format; fix cpplint (functional char-casts -> static_cast, int64_t/PRId64 instead of long, NOLINT for the cbindgen-generated header include) and a codespell typo.
testharness.cpp includes <gtest/gtest.h> but the objlib compile step has no ordering dependency on the googletest ExternalProject, so it can race ahead of header extraction (flaky 'gtest/gtest.h: No such file' in Release). Add an explicit add_dependencies.
Address review feedback (avoid Chinese in code): translate comments/docs in CMake, the smoke script, the Rust FFI crate (Cargo.toml / cbindgen.toml / build.rs / tokenizer.rs / callback_directory.rs) and the test-fixture READMEs. Chinese is kept only in tokenizer test data (the jieba CJK tokenization inputs/expectations).
…eaders

Move the kJiebaDictDirEnv lookup into a single GetJiebaDictionaryDirFromEnv
helper in tantivy_defs.h so the writer and reader stop defining their own
copies; each call site keeps its own missing-dir policy. Expand the remaining
short-form Apache headers in the tantivy sources, devcontainer files and
CorrosionFetch.cmake to the full boilerplate.
@lxy-9602

Copy link
Copy Markdown
Collaborator

Thank you very much for this contribution. Adding support for Tantivy is a very meaningful improvement and will help us better support full-text search scenarios. While since this PR is quite large, we’ll need some time to review it. Thanks for your understanding and patience.

@spaces-X

Copy link
Copy Markdown
Author

Thank you very much for this contribution. Adding support for Tantivy is a very meaningful improvement and will help us better support full-text search scenarios. While since this PR is quite large, we’ll need some time to review it. Thanks for your understanding and patience.

Thanks for taking the time to review!
I completely understand it's a large change and will take time to review. Please don't hesitate to let me know if you have any questions on specific parts - happy to help make the review easier.

Comment thread include/paimon/predicate/full_text_search.h
Comment thread include/paimon/predicate/full_text_search.h
Comment thread include/paimon/predicate/full_text_search.h
Comment thread include/paimon/predicate/full_text_search.h
Comment thread src/paimon/common/global_index/offset_global_index_reader_test.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_writer.h Outdated
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_writer.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_reader.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_reader.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_reader.cpp
Comment thread CMakeLists.txt
- default PAIMON_ENABLE_TANTIVY off (CI still builds it on)
- add with_score/min_score to FullTextSearch ctor
- ParseArchiveHeader -> ArchiveLayout::Parse; validate via InRange
- bulk-insert row_ids with AddMany on the unscored read path
- lucene: return NotImplemented when min_score is set
- writer stream buffer 64KB -> 1MB (heap-allocated)
- tests use ASSERT for guard checks; drop process comments and [BUG_*] markers
Comment thread src/paimon/global_index/tantivy/tantivy_global_index_reader.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_equivalence_test.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_equivalence_test.cpp
Comment thread src/paimon/global_index/tantivy/tantivy_equivalence_test.cpp Outdated
for (int64_t i = 0; i < array->length(); ++i) relative_row_ids[i] = i;
EXPECT_TRUE(writer_res.value()->AddBatch(&c_array, std::move(relative_row_ids)).ok());
auto metas_res = writer_res.value()->Finish();
EXPECT_TRUE(metas_res.ok()) << metas_res.status().ToString();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefer the test macros from testharness.h in tests, such as EXPECT_OK / ASSERT_OK and EXPECT_OK_AND_ASSIGN / ASSERT_OK_AND_ASSIGN.

@lxy-9602

Copy link
Copy Markdown
Collaborator

This PR looks very solid overall, and the tests are also quite thorough. Bridging Rust and C++ is not easy, so thank you for the contribution. Once the formatting issues mentioned above are addressed, we’ll take another look and continue a quick review.

Comment thread src/paimon/global_index/tantivy/tantivy_equivalence_test.cpp Outdated
spaces-X and others added 2 commits June 14, 2026 19:24
- fix StreamCtx double-free: Rust owns ctx on entry to
  paimon_tantivy_reader_new_streaming and releases it on every failure
  path; C++ no longer releases on failure. Add a Rust regression test.
- tests: prefer ASSERT over EXPECT in test bodies (keep EXPECT in
  value-returning helpers / thread lambdas where ASSERT is invalid)
- drop test-only setenv: GetJiebaDictionaryDirFromEnv falls back to the
  JIEBA_TEST_DICT_DIR macro (moved to tantivy_defs.cpp, single TU)
- remove leftover AI/process comments from the test files
- brace single-line for-loops; use testharness OK macros in equivalence test
std::string json = "[";
for (int i = 0; i < kDocCount; ++i) {
json += "[\"";
int n = word_count(rng);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please prefer int32_t over int in paimon-cpp.

const std::string& w = vocab[word_pick(rng)];
auto r = lreader->VisitFullTextSearch(std::make_shared<FullTextSearch>(
"f0", std::nullopt, w, FullTextSearch::SearchType::MATCH_ALL, std::nullopt));
ASSERT_TRUE(r.ok());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_OK

auto r = treader->VisitFullTextSearch(std::make_shared<FullTextSearch>(
"f0", std::nullopt, w, FullTextSearch::SearchType::MATCH_ALL, std::nullopt));
ASSERT_TRUE(r.ok());
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_OK

fts->with_score = true; // v0.2: explicit score opt-in
auto res = reader->VisitFullTextSearch(fts);
ASSERT_TRUE(res.ok()) << res.status().ToString();
auto scored = std::dynamic_pointer_cast<BitmapScoredGlobalIndexResult>(res.value());

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefer ASSERT_OK_AND_ASSIGN.

auto b = plain->GetBitmap();
EXPECT_OK(b.status()) << b.status().ToString();
if (b.ok()) bitmap = b.value();
} else if (auto scored = std::dynamic_pointer_cast<BitmapScoredGlobalIndexResult>(r)) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer EXPECT_OK_AND_ASSIGN rather than EXPECT_OK(b.status()) and use {} for single line if and for.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please help go through the codebase and fix similar issues as well.

std::cerr << " [" << i << "] " << layout.names[i] << " offset=" << layout.offsets[i]
<< " length=" << layout.lengths[i] << "\n";
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove debug cerr?

}

TEST_F(TantivyReaderTest, ChineseQueryMode) {
auto array = arrow::ipc::internal::json::ArrayFromJSON(DataType(), R"([

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to confirm whether reader_test and index_test are redundant. It seems that index_test already provides more complete coverage. If so, could we remove reader_test?

}

namespace paimon::tantivy {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For tests, please use the namespace paimon::tantivy::test.

auto r = ArchiveLayout::Parse(&in);
ASSERT_FALSE(r.ok());
ASSERT_NE(r.status().message().find("bad file_count"), std::string::npos)
<< r.status().ToString();

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASSERT_NOK_WITH_MSG

// =========================================================================

TEST_F(StreamingTestFixture, StreamingBenchmarkLog) {
auto rss_kb = []() {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this case really necessary? TantivyEquivalenceTest already includes BenchmarkBuildAndQuery.

* `hmm` mode is tested separately: FFI must return Unsupported.
*/

#include <algorithm>

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that cppjieba and jieba-rs may produce different token sequences, so we do not need to require parity between them.

However, could we make the expected output explicit for each input instead? As it stands, these tests look like regular assertions, but they are effectively reporting diffs to stderr and always succeeding. That means if the number of differences increases or decreases, CI will still pass and the diff output may be easy to miss in the logs.

A more robust approach might be to keep the cppjieba-vs-jieba-rs diff as a separate advisory report, while making the unit test assert stable expected outputs for jieba-rs on curated inputs.

if (files.size() != 1) {
return Status::Invalid("tantivy index only has one index file per shard, now num: {}",
files.size());
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Status::Invalid() does not perform fmt::format-style placeholder substitution internally — the literal {} will appear in the error message at runtime instead of the actual files.size() value.

Suggested fix:

return Status::Invalid(fmt::format("tantivy index only has one index file per shard, now num: {}",
                                   files.size()));

namespace {

/// Level mapping matches Rust side (0=trace..4=error).
extern "C" void PaimonTantivyLogAdapter(int32_t level, const char* msg, std::size_t len) {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: extern "C" function inside an anonymous namespace has contradictory linkage — anonymous namespace gives internal linkage while extern "C" implies external linkage. Some compilers (e.g. clang with -Wpedantic) will emit a warning.

Consider moving PaimonTantivyLogAdapter out of the anonymous namespace and marking it static instead, or placing it directly in namespace paimon::tantivy (since it is only referenced via function pointer, symbol visibility is not a concern).


FROM ubuntu:24.04

# Switch apt to Aliyun mirror for faster downloads (covers both

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded Aliyun/USTC mirrors will break or slow down builds for contributors outside mainland China. Consider parameterizing via ARG so the mirror URL can be overridden at build time without editing the Dockerfile:

ARG APT_MIRROR=http://archive.ubuntu.com/ubuntu
RUN sed -i "s|http://archive.ubuntu.com/ubuntu|${APT_MIRROR}|g" ...

Alternatively, move the mirror setup into a separate optional script.

// This is a reportable baseline, NOT a perf gate — assertions only check
// semantic correctness (each query returns >= 0 docs without erroring).
constexpr int kDocCount = 200;
constexpr int kQueryCount = 100;

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style: per docs/code-style.md, fixed-width integer types should be used instead of plain int. This applies to test files as well.

// Current:
constexpr int kDocCount = 200;
constexpr int kQueryCount = 100;
for (int i = 0; i < kDocCount; ++i) {

// Suggested:
constexpr int32_t kDocCount = 200;
constexpr int32_t kQueryCount = 100;
for (int32_t i = 0; i < kDocCount; ++i) {

Same issue exists in tantivy_ffi_test.cpp, tantivy_lucene_coexist_test.cpp, and tantivy_streaming_test.cpp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants