feat: CBA idle re-modeling and separate scale-up / scale-down task-count boundaries#19378
feat: CBA idle re-modeling and separate scale-up / scale-down task-count boundaries#19378Fly-Style wants to merge 7 commits intoapache:masterfrom
Conversation
…of scale-up and scale-down boundaries application
| this.useTaskCountBoundaries = Configs.valueOrDefault(useTaskCountBoundaries, false); | ||
| this.highLagThreshold = Configs.valueOrDefault(highLagThreshold, -1); | ||
| this.minScaleUpDelay = Configs.valueOrDefault(minScaleUpDelay, Duration.millis(this.minTriggerScaleActionFrequencyMillis)); | ||
| this.useTaskCountBoundariesOnScaleUp = Configs.valueOrDefault(useTaskCountBoundariesOnScaleUp, false); |
There was a problem hiding this comment.
[P2] Legacy boundary setting is accepted but ignored
The constructor still accepts the legacy useTaskCountBoundaries property, but the new scale-up/down fields are initialized only from useTaskCountBoundariesOnScaleUp and useTaskCountBoundariesOnScaleDown. Existing supervisor specs with useTaskCountBoundaries: true will silently lose the scale-up boundary after upgrade, allowing unbounded jumps to any candidate task count. Map the legacy value to the new fields when the new fields are absent, or reject/document a breaking config change.
There was a problem hiding this comment.
This autoscaler was in experimental mode, but I will log on warn level and document the breaking change.
ec723ee to
5b61b6d
Compare
gianm
left a comment
There was a problem hiding this comment.
Something seems off with the "Scaledown scenery, visualized" plot. It says the conditions are:
Task boundaries for scale-down are enabled, taskCount = partitionCount = 128, lag = 0.
It shows that when current idle ratio is < 0.15 or so, the optimal task count becomes ~40. Scaling down 3x when current idle ratio is low will likely lead to the new set of tasks being overloaded.
| * Maximum number of candidate task counts to evaluate above or below the current task count | ||
| * when scale-up or scale-down boundaries are enabled. | ||
| * <p> | ||
| * The misspelling is preserved to avoid unnecessary churn in this package-private constant. |
There was a problem hiding this comment.
I don't understand this comment. The constant is new in this patch. Please fix the spelling.
|
|
||
| At every evaluation interval, Druid computes the score for each candidate task count and picks the one with the lowest total cost. | ||
|
|
||
| Note: Kinesis is not supported yet, support is in progress. |
There was a problem hiding this comment.
I need to verify if anybody have a Kinesis workload with CBA working. If you want, we can remove that part.
| * during extensive testing as the most balanced multiplier for high-lag recovery. | ||
| */ | ||
| static final double LAG_AMPLIFICATION_MULTIPLIER = 0.05; | ||
| static final double LAG_AMPLIFICATION_MULTIPLIER = 0.4; |
There was a problem hiding this comment.
I will note it in the patch notes. Generally, the intention was to find a point where scale-up/scale-down decisions are 'normal' in terms of normal distribution near 0.5/0.5 weights. 0.4 is a good amplification multiplier. 0.4/0.6 as default weights was picked from conservativity point of view.
| .minTriggerScaleActionFrequencyMillis(1000) | ||
| .lagWeight(0.2) | ||
| .idleWeight(0.8) | ||
| .lagWeight(0.8) |
There was a problem hiding this comment.
What will be the effect of the change to lag and idle weights?
There was a problem hiding this comment.
It was passing without any problems in normal circumstances. The main idea of the change is to reduce the potential of not scaling over the timeout due to CI CPU pressure.
Oh, this is a lag in my head, apologies. 🤦🏻 |
This PR updates the seekable-stream cost-based autoscaler to make task-count decisions more stable and easier to reason about.
The main behavioral change is replacing the previous linear idle cost with a U-shaped idle cost centered around an ideal idle ratio. This penalizes both under-provisioning, where tasks have too little idle headroom, and over-provisioning, where tasks spend too much time idle. The
predictedIdleRatioclamp at 0 made the U-shape's under-provisioning penalty saturate, so smallertaskCountalways won the cost race — the optimizer scaled down abusy, lag-free cluster. The goal is to keep ingestion tasks near a practical operating point instead of treating all additional idle time as uniformly bad.
This also separates task-count boundary controls for scale-up and scale-down. Scale-up remains unbounded by default so the autoscaler can react aggressively to lag, while scale-down is bounded by default to avoid large drops in task count. Candidate task counts are still generated from valid partitions-per-task ratios, but the optimizer can now limit which candidates are evaluated depending on the configured scale direction boundary.
What Changed
WeightedCostFunction, with an ideal idle ratio and asymmetric penalties for under- and over-provisioning.useTaskCountBoundariessetting into separate for scale up/scale downCostResult.INFINITE_COSTso skipped candidates can still be represented safely in cost tables.highLagThresholdanduseTaskCountBoundariesparameter, as obsolete, but kept in ctor for b/w compatibility.0.5/0.5weights.0.4is a good amplification multiplier that was picked after a series of tests and calculations.0.4/0.6as default weights was picked from a conservative point of view. A Python script with computations is available by request.Details
Updated scaleup scenery, visualized.
Details
Task boundaries are disabled, lag = 50k, current
taskCountis 1. Plot contains p* as partitions count, and idle as currentpoll-idle-avg-ratiometric.Scaledown scenery, visualized.
Details
Task boundaries for scale-down are disabled,
taskCount=partitionCount= 128, lag = 0.This PR has: