Skip to content

5% BERT benchmark improvement: Remove unnecessary to_vec() from slice()#1964

Merged
ArthurZucker merged 1 commit intohuggingface:mainfrom
jberg5:bert-perf
Mar 24, 2026
Merged

5% BERT benchmark improvement: Remove unnecessary to_vec() from slice()#1964
ArthurZucker merged 1 commit intohuggingface:mainfrom
jberg5:bert-perf

Conversation

@jberg5
Copy link
Copy Markdown
Contributor

@jberg5 jberg5 commented Mar 16, 2026

NormalizedString::slice was doing an unnecessary to_vec(), which meant doing an allocation and copy of N alignment tuples, and then doing it again on collect(). Removing to_vec() theoretically cuts the work that slice() has to do in half, which shows up as a consistent 5% performance improvement on the end to end runtime of the BERT benchmark.

Full AI disclosure: Claude (with Opus 4.6) spotted this one when I asked it to do a large scale audit of this repo looking for bugs and performance improvements. This is a small change in terms of code, and it seems legit to me :) I ran benchmarks both locally on my macbook, and on a gcloud c2-standard-4. Here's Claude's summary of what that looked like:

  We profiled the BERT encode pipeline on a GCloud c2-standard-4 (dedicated Intel Xeon cores) using perf record with
  Chinese text (红楼梦, 80% CJK). The perf profile showed NormalizedString::slice at 3.6% of total encode time, plus
  its allocation overhead spread across malloc (5.8%), cfree (3.0%), and Vec::from_iter (2.0%). The .to_vec() was
  creating a redundant intermediate Vec on every call — one per text segment, per split pass.

  Benchmark methodology

  Built baseline and optimized binaries from the same source, differing only in the .to_vec() line. Ran them in
  alternating ABBAAB pattern (6 runs each) to control for thermal drift and cache effects.

  BERT WordPiece, ASCII English text (big.txt, 6.2MB, ARM Apple M-series):
  Baseline:  mean=9.506s  stdev=0.110s  range=[9.413, 9.713]
  Optimized: mean=8.951s  stdev=0.082s  range=[8.866, 9.071]
  Improvement: 5.8%, ranges do not overlap

  BERT WordPiece, CJK Chinese text (红楼梦, 2.5MB, GCloud c2-standard-4 Intel Xeon):
  Baseline:  mean=3.178s  stdev=0.008s  range=[3.17, 3.19]
  Optimized: mean=3.055s  stdev=0.024s  range=[3.04, 3.10]
  Improvement: 3.9%, ranges do not overlap

@jberg5
Copy link
Copy Markdown
Contributor Author

jberg5 commented Mar 16, 2026

I wasn't sure if it was appropriate to create a github issue for this; let me know if I should, happy to do so.

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, if tests are all passing merging!

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker ArthurZucker merged commit cbd8cf2 into huggingface:main Mar 24, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants