WIP. HDDS-15014. Speed up EC container decommission#10082
WIP. HDDS-15014. Speed up EC container decommission#10082jojochuang wants to merge 2 commits intoapache:masterfrom
Conversation
…witching and disk-aware scheduling
1. SCM Replication Manager Enhancements:
- Threshold-Based Switching: EC containers on decommissioning nodes now automatically switch from simple replication to reconstruction if the source node's replication load exceeds a configurable threshold (hdds.scm.replication.decommission.ec.reconstruction.threshold, default: 5).
- Feature Flag: Added hdds.scm.replication.decommission.ec.reconstruction.enabled (default: true) to explicitly toggle this behavior.
- Global Concurrency Limit: Implemented a cluster-wide cap (hdds.scm.replication.decommission.concurrency, default: 100) to prevent cluster-wide performance degradation during large-scale decommissions.
- Observability: Added the InflightDecommission gauge to ReplicationManagerMetrics to track simultaneous decommissioning tasks.
2. Datanode Scheduling Enhancements:
- Disk-Aware Volume Selection: The ReplicationSupervisor now tracks in-flight tasks per physical disk volume.
- Non-Busy Disk Prioritization: The ContainerImporter and DownloadAndImportReplicator were updated to prioritize target volumes with zero in-flight tasks, ensuring better I/O distribution and avoiding disk bottlenecks.
3. Stability and Infrastructure:
- Bug Fix: Fixed a series of Hugo documentation build errors caused by invalid date formats in markdown front matter.
Change-Id: I426c6aa8cc5f1321468db13f15591a0a3bf06ac1
Change-Id: I1ac31515d9035f338ca83c07c309fdfedf911677
|
Thanks @jojochuang for working on this.
|
|
|
||
| /* If we get here, the scenario is: | ||
| 1. Under replicated. | ||
| 2. Not over replicated. |
There was a problem hiding this comment.
Why are all these comments getting removed?
|
There are a few things to think about in this change. First for a 3-2 container, if there are 3 nodes being decommissioned, reconstruction isn't possible to recover all the decommissioning copies. Ideally, there would be a task scheduled to reconstruct all 3 at a time, but with 3 out that won't work and it will have to pick just 2 to recover and leave the other one for now. Or it could schedule two tasks at the same time for the container. I know that within Cloudera we had a case where a cluster was showing slow decommission progress on EC containers, but I am not sure if we investigated if the datanode was being kept sufficiently busy. Ie, did it have enough replications scheduled so that its queue did not run empty between each heartbeat. Did it have enough worker threads to keep all the disks busy etc. I think that is important to understand before adding this complexity, but also if the DNs are not being kept busy enough, then it will be the same problem for reconstruction commands too. It also important from an overall cluster load perspective, especially with large EC schemes (6-3, 10-4) that as much as possible the replications are simple copies from the decommissioning host. The overhead of doing a reconstruction on a container vs a simple copy is 3x the data pulled over the network for 3-2. For 6-3 its 6x for 10-4 its 10x and all that traffic will be cross rack as we try to spread the replicas across as many racks as possible. At the moment, the logic in the ECUnderReplicationHandler deals with missing indexes, then decommissioning indexes. If it gets a "node over utilised" exception dealing with the decommissioning indexes, the simplest change it to probably mask off the decommissioning indexes up to the parity number, add them to an excluded list and then call There should already be a global concurrency limit in RM, so a new one should not be needed. I also agree with @adoroszlai that this change should be at least 2 or 3 separate changes. |
|
Should we review the design doc in #10086 before returning to this change, or are they meant to be reviewed in parallel? |
This change cannot be reviewed due to severely outdated dev branch, regardless of the design doc. |
What changes were proposed in this pull request?
HDDS-15014. Speed up EC container decommission
Please describe your PR in detail:
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15014
How was this patch tested?