Skip to content

[WIP] Incremental processing storage v3#585

Draft
rkistner wants to merge 37 commits intomainfrom
incremental-processing-storage
Draft

[WIP] Incremental processing storage v3#585
rkistner wants to merge 37 commits intomainfrom
incremental-processing-storage

Conversation

@rkistner
Copy link
Contributor

@rkistner rkistner commented Mar 24, 2026

This rewrites the MongoDB storage for version 3, in preparation for incremental reprocessing.

Postgres storage will follow in a future PR - this one is already big enough.

On a high level:

  1. We partition bucket_data and parameter_indexes (previously bucket_parameters) by source definition, instead of replication stream (previously group_id).
  2. We partition source_records (previously current_data) by source table.
  3. The partition is now physical, using separate collections, instead of only a logical separation by _id or foreign keys.
  4. The specific storage format is slightly adjusted to account for the changes required for incremental reprocessing (detailed write-up to follow).

All of this is implemented in storage version 3 only - storage formats for versions 1 and 2 are unchanged. However, the implementation is restructured to account for the different logic in the different versions now.

The collection split has some advantages:

  1. Removing data becomes very cheap - just drop the collection.
  2. Reads and writes become faster by moving the common data to the collection name, instead of duplicating in each document.

The split is primarily for performance reasons, not functionally required for incremental reprocessing. It just makes sense to make the changes at the same time, while we're making significant storage changes. It does have some caveats - there are some code paths that query across multiple collections now, which could be slower. We do still need to optimize those cases.

This PR departs from the storage structure used in the incremental reprocessing POC in #468:

  1. The POC made changes on the storage structure directly, without regard for storage versions or backwards-compatibility.
  2. The POC did not use the collection splits.

TODO:

  • Properly document.
  • Cleanup code - we may need to further split the implementations between V1 and V3.
  • More tests?
  • Plan for performance improvements for (1) source_records pending_deletes, (2) parameter_index change detection between checkpoints.
  • Test performance of bucket checksum and data reads WRT clustered collections and split collections, compared to V1.

@changeset-bot
Copy link

changeset-bot bot commented Mar 24, 2026

⚠️ No Changeset found

Latest commit: 41c59b6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant