Skip to content

feat: performance, adding pcre2 backend + regex-shards (5-15% speedup)#1968

Open
michaelfeil wants to merge 7 commits intohuggingface:mainfrom
michaelfeil:mf/pcre2-jit-backend
Open

feat: performance, adding pcre2 backend + regex-shards (5-15% speedup)#1968
michaelfeil wants to merge 7 commits intohuggingface:mainfrom
michaelfeil:mf/pcre2-jit-backend

Conversation

@michaelfeil
Copy link
Contributor

@michaelfeil michaelfeil commented Mar 19, 2026

Tips and tricks from a Blog post describing how this lib is 4-10x slower than it needs to be. https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization

Performance: PCRE2 JIT regex backend

Benchmarked on LLaMA 3 tokenizer (data/llama-3-tokenizer.json) with data/big.txt (6.5 MB, 128K lines).

Setup

  • Baseline: main branch with default onig (Oniguruma) regex backend
  • Test: same branch with --no-default-features --features pcre2
  • Hardware: Linux, Intel Xeon
  • Benchmark: cargo bench --bench llama3_benchmark

Results

Benchmark onig (main) pcre2 JIT Change
Sequential encode (full file) 1.82 s 1.63 s -10 to -14%
Batch encode (1000 items) 737 ms 658 ms ~-5 to -11%
Offsets (batch + char offsets) 295 ms 288 ms ~-3%
BPE Train ~10 s ~10 s no change

Sequential encode is the most stable measurement. Batch/offsets show improvement but have higher run-to-run variance due to thread scheduling.

Why

PCRE2 with JIT compiles regex patterns to native machine code at initialization. The GPT-2 byte-level pre-tokenization regex ('s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+) is the hot path for every encode() call and benefits directly from JIT compilation vs Oniguruma's interpreted matching.

Usage

# Cargo.toml
[dependencies]
tokenizers = { version = "0.22", features = ["pcre2"] }

The pcre2 feature is opt-in. When enabled, it takes priority over onig and fancy-regex. Cross-compilation works the same as onigpcre2-sys bundles and compiles PCRE2 from C source via the cc crate. Not available on WASM (use fancy-regex there, as before).

@wheynelau
Copy link
Contributor

@michaelfeil Thanks for providing more details! I didn't know that fasttokens used pcre2 as well, just learnt about it yesterday.

@michaelfeil michaelfeil changed the title feat: performance, adding pcre2 backend feat: performance, adding pcre2 backend + regex-shards (5-15% speedup) Mar 20, 2026
@wheynelau
Copy link
Contributor

wheynelau commented Mar 20, 2026

Curious as you are active on this, have you tried thread local caches to see if they help? Or other distributed caches like dashmap and parking lot? Thank you!

@michaelfeil
Copy link
Contributor Author

michaelfeil commented Mar 20, 2026

@wheynelau thats the other PR. #1967

I think thread_local caches are most useful if re-used in a rayon pool, or if the threads are in some kind of pool.
For the L1 in the blog post, I tried to implement something, and gave it to claude, but both did not work. Implementing the shared lookup is a bit better IMO, since hashing is fast + you only need to loopup by hash, so the only place that uses RwLock is guarded.

Copy link
Member

@McPatate McPatate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff! Didn't repro the benchmarks yet, but looking good so far! Thank you for taking the time.

pub fn find_matches(
&self,
inside: &str,
) -> Result<Vec<(Offsets, bool)>, Box<dyn Error + Send + Sync + 'static>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't need to be a Result or am I missing something?

@@ -0,0 +1,4 @@
[env]
# Only takes effect when the optional `pcre2` feature is enabled and `pcre2-sys`
# is built. Keeps runtime free of dynamic libpcre2 dependency.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and pcre2-sys is built

which is always?

})
}

pub fn find_matches(&self, inside: &str) -> Result<Vec<(Offsets, bool)>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here for the Result ?

}

impl Pcre2Regex {
fn compile(pattern: &str) -> Result<Self, Box<dyn Error + Send + Sync + 'static>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have the dependency on thiserror, would much rather have a proper defined error type

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or perhaps at least use crate::Result instead

}

// Safety: pcre2::bytes::Regex is Send+Sync. Each Pcre2Regex instance has its
// own match context, so concurrent find_at calls on *different* instances are safe.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But compile is not thread safe I assume?


[features]
default = ["progressbar", "onig", "esaxx_fast"]
pcre2 = ["dep:pcre2", "fancy-regex"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is one declared with the dep: prefix and not the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add at least one extra test to make sure the fallback logic is sane here.

@McPatate McPatate force-pushed the mf/pcre2-jit-backend branch from ddcf798 to 14204e6 Compare March 25, 2026 13:14
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants