Less quadratic half_join#706
Open
frankmcsherry wants to merge 3 commits intoTimelyDataflow:master-nextfrom
Open
Less quadratic half_join#706frankmcsherry wants to merge 3 commits intoTimelyDataflow:master-nextfrom
half_join#706frankmcsherry wants to merge 3 commits intoTimelyDataflow:master-nextfrom
Conversation
3941421 to
b3a0d48
Compare
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The current
half_join.rsis a dog's breakfast of code. In particular, it has accidentally quadratic behavior when there is a substantial backlog of input updates and the user chooses to yield without doing much of the work. For example, with 1B input updates and yielding after 1M units of work, there will be ~1K reactivations each of which will do work linear in the remaining updates. Not great.This new version is .. really quite different. It leans in to the total order on time that all uses of this operator have, and maintains blobs of pending updates ordered by time, and then other blobs of active updates ordered by key. We work our way through the active blobs subject to the yield logic, and when we run out we harvest more active blobs from the pending blobs.
The total order on time is meant to make it easy to reason about which updates are unlocked when. Relatively little funny frontier logic here.
There is some remaining complexity around the API which needs to provide an active session to the user closure (rather than just a container builder), which combined with the potential surprise of yielding makes it hard for us to plan work ahead of time. Changing this to be a container builder instead, which we can give without creating a session for each closure call, seems like a clear win.