feat: performance, adding pcre2 backend + regex-shards (5-15% speedup)#1968
feat: performance, adding pcre2 backend + regex-shards (5-15% speedup)#1968michaelfeil wants to merge 7 commits intohuggingface:mainfrom
Conversation
|
@michaelfeil Thanks for providing more details! I didn't know that fasttokens used pcre2 as well, just learnt about it yesterday. |
|
Curious as you are active on this, have you tried thread local caches to see if they help? Or other distributed caches like dashmap and parking lot? Thank you! |
|
@wheynelau thats the other PR. #1967 I think thread_local caches are most useful if re-used in a rayon pool, or if the threads are in some kind of pool. |
McPatate
left a comment
There was a problem hiding this comment.
Good stuff! Didn't repro the benchmarks yet, but looking good so far! Thank you for taking the time.
| pub fn find_matches( | ||
| &self, | ||
| inside: &str, | ||
| ) -> Result<Vec<(Offsets, bool)>, Box<dyn Error + Send + Sync + 'static>> { |
There was a problem hiding this comment.
this doesn't need to be a Result or am I missing something?
| @@ -0,0 +1,4 @@ | |||
| [env] | |||
| # Only takes effect when the optional `pcre2` feature is enabled and `pcre2-sys` | |||
| # is built. Keeps runtime free of dynamic libpcre2 dependency. | |||
There was a problem hiding this comment.
and
pcre2-sysis built
which is always?
| }) | ||
| } | ||
|
|
||
| pub fn find_matches(&self, inside: &str) -> Result<Vec<(Offsets, bool)>> { |
| } | ||
|
|
||
| impl Pcre2Regex { | ||
| fn compile(pattern: &str) -> Result<Self, Box<dyn Error + Send + Sync + 'static>> { |
There was a problem hiding this comment.
I think we have the dependency on thiserror, would much rather have a proper defined error type
There was a problem hiding this comment.
or perhaps at least use crate::Result instead
| } | ||
|
|
||
| // Safety: pcre2::bytes::Regex is Send+Sync. Each Pcre2Regex instance has its | ||
| // own match context, so concurrent find_at calls on *different* instances are safe. |
There was a problem hiding this comment.
But compile is not thread safe I assume?
|
|
||
| [features] | ||
| default = ["progressbar", "onig", "esaxx_fast"] | ||
| pcre2 = ["dep:pcre2", "fancy-regex"] |
There was a problem hiding this comment.
why is one declared with the dep: prefix and not the other?
There was a problem hiding this comment.
I'd add at least one extra test to make sure the fallback logic is sane here.
ddcf798 to
14204e6
Compare
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
Tips and tricks from a Blog post describing how this lib is 4-10x slower than it needs to be. https://www.crusoe.ai/resources/blog/reducing-ttft-by-cpumaxxing-tokenization
Performance: PCRE2 JIT regex backend
Benchmarked on LLaMA 3 tokenizer (
data/llama-3-tokenizer.json) withdata/big.txt(6.5 MB, 128K lines).Setup
mainbranch with defaultonig(Oniguruma) regex backend--no-default-features --features pcre2cargo bench --bench llama3_benchmarkResults
Sequential encode is the most stable measurement. Batch/offsets show improvement but have higher run-to-run variance due to thread scheduling.
Why
PCRE2 with JIT compiles regex patterns to native machine code at initialization. The GPT-2 byte-level pre-tokenization regex (
's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+) is the hot path for everyencode()call and benefits directly from JIT compilation vs Oniguruma's interpreted matching.Usage
The
pcre2feature is opt-in. When enabled, it takes priority overonigandfancy-regex. Cross-compilation works the same asonig—pcre2-sysbundles and compiles PCRE2 from C source via thecccrate. Not available on WASM (usefancy-regexthere, as before).