Add sortbytime option to parseToDisk for time-sorted HDB output by victory-scott · Pull Request #1 · KxSystems/taq

victory-scott · 2026-04-14T14:35:52Z

Why

Our team builds a replay blueprint on top of TAQ HDBs; replay needs events in time order, not sym-parted. Today we run parseToDisk with defaults and then re-sort the resulting HDB into a second time-ordered HDB, which doubles disk usage and adds a whole extra pass. This PR lets parseToDisk produce the time-ordered HDB directly.

As a side benefit, the implementation has a bounded memory profile, so CE users can time-sort full-day datasets that wouldn't fit under the working-set cap if we tried to sort in memory at the end. (We considered whether the same trick could make the default p#sym path CE-friendly, but p# isn't preserved across column upsert the way s# is, so that turns out to be a different problem.)

What

A new optional key on parseToDisk:

parseToDisk["..."; 2025.07.01; "..."; ([sortbytime: 1b])]

Default is 0b, so existing callers are unaffected. When 1b, each parsed batch is sorted by time in memory and written to a per-batch splayed stage (trade_stage_N, quote_stage_N). At finalize, stages are streamed into the final trade and quote tables by a small k-way merge (mergeTimeStages), producing s#time on the final output — no p#sym.

Per-batch rather than per-file staging because TAQ PSVs turn out to be sym-then-time internally (I spent some time learning this the hard way), so batches aren't globally time-monotone.

Keeping the diff small

I leaned on the module's existing seams to avoid restructuring:

Reused enumAndSave / writerWrapper / genericUpsert unchanged — the stage writers are thin wrappers around them.
Dispatched via the existing preparse / postparse dicts that parsePSVs already walks; no changes to parsePSVs, process, or batchProcess.
Factored the two merge-finalize paths into mergeOne[...]'[trade;quote] and the two psym calls into psymOne each trade;quote so the sym/time branch is a clean $[].

New pieces are isolated:

mergeTimeStages (~15 lines): the k-way streaming merger.
hdelSplayed: a small helper.
MERGE_CHUNK constant for tuning memory vs. I/O.

Total: taq/init.q +79 / -10. Other files are docs and tests.

Tests

Added three new cases to test.q plus a testTimeSortedTables helper that asserts:

`s# on time for both trade and quote, absence of `p#sym
time column globally monotone per partition
range queries resolve correctly
stage directories are cleaned up

New cases cover: default sortbytime: 1b, sortbytime + batchsize: 0, and sortbytime + linked: 1b. The six original persistent-DB tests still pass unchanged.

What I wasn't sure about

A few things a q-stronger reviewer should eyeball:

mergeTimeStages uses xasc raze frames per iteration rather than a true k-way merge sort. For our K (tens, maybe low hundreds), I believe xasc as a C primitive beats anything I'd write in q, but happy to revisit if you disagree.
MERGE_CHUNK is a module-level constant rather than a user-facing parameter. Easy to promote if it turns out people want to tune it.
setCompr's deferred-call idiom (the {[x;]} with trailing ;) tripped me briefly during refactoring; final code preserves the original semantics.

Try it

q test.q /tmp -s 4

Produces `s#time output via per-batch staging and a bounded-memory k-way merge at finalize, instead of the default `p#sym layout. Opt-in via sortbytime: 1b; default is unchanged. Intended primarily for time-ordered replay workflows, with the bounded memory profile also enabling CE-fit time-sort on datasets that wouldn't fit an in-memory sort. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sortbytime option to parseToDisk for time-sorted HDB output#1

Add sortbytime option to parseToDisk for time-sorted HDB output#1
victory-scott wants to merge 1 commit into
KxSystems:mainfrom
victory-scott:add-sortbytime-option

victory-scott commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

victory-scott commented Apr 14, 2026

Why

What

Keeping the diff small

Tests

What I wasn't sure about

Try it

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants