Add sortbytime option to parseToDisk for time-sorted HDB output#1
Open
victory-scott wants to merge 1 commit into
Open
Add sortbytime option to parseToDisk for time-sorted HDB output#1victory-scott wants to merge 1 commit into
victory-scott wants to merge 1 commit into
Conversation
Produces `s#time output via per-batch staging and a bounded-memory k-way merge at finalize, instead of the default `p#sym layout. Opt-in via sortbytime: 1b; default is unchanged. Intended primarily for time-ordered replay workflows, with the bounded memory profile also enabling CE-fit time-sort on datasets that wouldn't fit an in-memory sort. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Our team builds a replay blueprint on top of TAQ HDBs; replay needs events in time order, not sym-parted. Today we run
parseToDiskwith defaults and then re-sort the resulting HDB into a second time-ordered HDB, which doubles disk usage and adds a whole extra pass. This PR letsparseToDiskproduce the time-ordered HDB directly.As a side benefit, the implementation has a bounded memory profile, so CE users can time-sort full-day datasets that wouldn't fit under the working-set cap if we tried to sort in memory at the end. (We considered whether the same trick could make the default
p#sympath CE-friendly, butp#isn't preserved across columnupsertthe ways#is, so that turns out to be a different problem.)What
A new optional key on
parseToDisk:Default is
0b, so existing callers are unaffected. When1b, each parsed batch is sorted by time in memory and written to a per-batch splayed stage (trade_stage_N,quote_stage_N). At finalize, stages are streamed into the finaltradeandquotetables by a small k-way merge (mergeTimeStages), producings#timeon the final output — nop#sym.Per-batch rather than per-file staging because TAQ PSVs turn out to be sym-then-time internally (I spent some time learning this the hard way), so batches aren't globally time-monotone.
Keeping the diff small
I leaned on the module's existing seams to avoid restructuring:
enumAndSave/writerWrapper/genericUpsertunchanged — the stage writers are thin wrappers around them.preparse/postparsedicts thatparsePSVsalready walks; no changes toparsePSVs,process, orbatchProcess.mergeOne[...]'[trade;quote]and the two psym calls intopsymOne each trade;quoteso the sym/time branch is a clean$[].New pieces are isolated:
mergeTimeStages(~15 lines): the k-way streaming merger.hdelSplayed: a small helper.MERGE_CHUNKconstant for tuning memory vs. I/O.Total:
taq/init.q+79 / -10. Other files are docs and tests.Tests
Added three new cases to
test.qplus atestTimeSortedTableshelper that asserts:`s#ontimefor both trade and quote, absence of`p#symNew cases cover: default
sortbytime: 1b,sortbytime+batchsize: 0, andsortbytime+linked: 1b. The six original persistent-DB tests still pass unchanged.What I wasn't sure about
A few things a q-stronger reviewer should eyeball:
mergeTimeStagesusesxasc raze framesper iteration rather than a true k-way merge sort. For our K (tens, maybe low hundreds), I believexascas a C primitive beats anything I'd write in q, but happy to revisit if you disagree.MERGE_CHUNKis a module-level constant rather than a user-facing parameter. Easy to promote if it turns out people want to tune it.setCompr's deferred-call idiom (the{[x;]}with trailing;) tripped me briefly during refactoring; final code preserves the original semantics.Try it