ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory)#14
Open
antalvdb wants to merge 1 commit into
Open
ClassDistribution: store distributions as a sorted vector<Vfield> (−32% base memory)#14antalvdb wants to merge 1 commit into
antalvdb wants to merge 1 commit into
Conversation
The per-class distribution was a std::map<size_t, Vfield*>: a red-black-tree
node plus a separately heap-allocated Vfield for every target class, for
(potentially) every node in the instance base. On a large base this is the
dominant memory consumer.
Store the distribution as a flat std::vector<Vfield>, kept sorted by
value->Index() (the map key was redundant -- it always equals value->Index()).
Vfield becomes a plain copyable value type stored inline. Lookups are a binary
search, with an O(1) fast path for the common sorted-append case; the sorted
invariant keeps (de)serialisation byte-for-byte identical.
Measured on a word-prediction base (~4.9M instances, ~13.5M nodes; l4r0):
- peak RSS loading a TRIBL2 base: 2.07 GB -> 1.40 GB (-32%)
- peak RSS, IGTree: 1.26 GB -> 0.87 GB (-31%)
- TRIBL2 test-phase instructions: +3.4% (in-place sorted inserts cost a tail
shift where the tree was O(log n); intrinsic to the layout, bites only on
large distributions)
Verified byte-identical test output (plain and +v db), a freshly trained
instance base byte-identical to the previous on-disk format, and unchanged
IGTree accuracy.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Proposes changing
ClassDistribution's storage fromstd::map<size_t, Vfield*>to a sortedstd::vector<Vfield>, to cut the instance-base memory footprint.Today each target class in a distribution costs a red-black-tree node plus a separately heap-allocated
Vfield, and there is (potentially) one distribution per instance-base node — so on a large base this is the dominant memory consumer. This stores the distribution as a flat vector ofVfieldvalues kept sorted byvalue->Index()(the map key was redundant — it always equalsvalue->Index()).Vfieldbecomes a plain value type stored inline. Lookups are a binary search with an O(1) fast path for the common sorted-append case.Measured impact
Word-prediction base, ~4.9M training instances / ~13.5M nodes (l4r0 next-token data):
The +3.4% is intrinsic to in-place sorted-vector inserts (a tail shift where the tree was O(log n)) and only matters for large distributions. Wall-clock wasn't reliably measurable (the test machine thermally throttles under repeated trainings), but contiguous storage should fault less than the pointer-chasing map. Net: a memory-for-CPU trade that's favourable for memory-bound use.
Correctness
+v db(distributions printed).Scope
ClassDistribution/Vfield(Targets.h/.cxx), theTargetDistiterations inFeatures.cxx, and 3 lines inMBLClass.cxx.sum_distributionsand the I/O code go through the public API and are untouched.Posting for your consideration — happy to adjust.
🤖 Generated with Claude Code