Skip to content

Add sortbytime option to parseToDisk for time-sorted HDB output#1

Open
victory-scott wants to merge 1 commit into
KxSystems:mainfrom
victory-scott:add-sortbytime-option
Open

Add sortbytime option to parseToDisk for time-sorted HDB output#1
victory-scott wants to merge 1 commit into
KxSystems:mainfrom
victory-scott:add-sortbytime-option

Conversation

@victory-scott
Copy link
Copy Markdown

Why

Our team builds a replay blueprint on top of TAQ HDBs; replay needs events in time order, not sym-parted. Today we run parseToDisk with defaults and then re-sort the resulting HDB into a second time-ordered HDB, which doubles disk usage and adds a whole extra pass. This PR lets parseToDisk produce the time-ordered HDB directly.

As a side benefit, the implementation has a bounded memory profile, so CE users can time-sort full-day datasets that wouldn't fit under the working-set cap if we tried to sort in memory at the end. (We considered whether the same trick could make the default p#sym path CE-friendly, but p# isn't preserved across column upsert the way s# is, so that turns out to be a different problem.)

What

A new optional key on parseToDisk:

parseToDisk["..."; 2025.07.01; "..."; ([sortbytime: 1b])]

Default is 0b, so existing callers are unaffected. When 1b, each parsed batch is sorted by time in memory and written to a per-batch splayed stage (trade_stage_N, quote_stage_N). At finalize, stages are streamed into the final trade and quote tables by a small k-way merge (mergeTimeStages), producing s#time on the final output — no p#sym.

Per-batch rather than per-file staging because TAQ PSVs turn out to be sym-then-time internally (I spent some time learning this the hard way), so batches aren't globally time-monotone.

Keeping the diff small

I leaned on the module's existing seams to avoid restructuring:

  • Reused enumAndSave / writerWrapper / genericUpsert unchanged — the stage writers are thin wrappers around them.
  • Dispatched via the existing preparse / postparse dicts that parsePSVs already walks; no changes to parsePSVs, process, or batchProcess.
  • Factored the two merge-finalize paths into mergeOne[...]'[trade;quote] and the two psym calls into psymOne each trade;quote so the sym/time branch is a clean $[].

New pieces are isolated:

  • mergeTimeStages (~15 lines): the k-way streaming merger.
  • hdelSplayed: a small helper.
  • MERGE_CHUNK constant for tuning memory vs. I/O.

Total: taq/init.q +79 / -10. Other files are docs and tests.

Tests

Added three new cases to test.q plus a testTimeSortedTables helper that asserts:

  • `s# on time for both trade and quote, absence of `p#sym
  • time column globally monotone per partition
  • range queries resolve correctly
  • stage directories are cleaned up

New cases cover: default sortbytime: 1b, sortbytime + batchsize: 0, and sortbytime + linked: 1b. The six original persistent-DB tests still pass unchanged.

What I wasn't sure about

A few things a q-stronger reviewer should eyeball:

  • mergeTimeStages uses xasc raze frames per iteration rather than a true k-way merge sort. For our K (tens, maybe low hundreds), I believe xasc as a C primitive beats anything I'd write in q, but happy to revisit if you disagree.
  • MERGE_CHUNK is a module-level constant rather than a user-facing parameter. Easy to promote if it turns out people want to tune it.
  • setCompr's deferred-call idiom (the {[x;]} with trailing ;) tripped me briefly during refactoring; final code preserves the original semantics.

Try it

q test.q /tmp -s 4

Produces `s#time output via per-batch staging and a bounded-memory k-way
merge at finalize, instead of the default `p#sym layout. Opt-in via
sortbytime: 1b; default is unchanged. Intended primarily for time-ordered
replay workflows, with the bounded memory profile also enabling CE-fit
time-sort on datasets that wouldn't fit an in-memory sort.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants