Skip to content

WIP. HDDS-15014. Speed up EC container decommission#10082

Draft
jojochuang wants to merge 2 commits intoapache:masterfrom
jojochuang:high-throughput-ec-decom
Draft

WIP. HDDS-15014. Speed up EC container decommission#10082
jojochuang wants to merge 2 commits intoapache:masterfrom
jojochuang:high-throughput-ec-decom

Conversation

@jojochuang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

HDDS-15014. Speed up EC container decommission

Please describe your PR in detail:

1. SCM Replication Manager Enhancements:
    - Threshold-Based Switching: EC containers on decommissioning nodes now automatically switch from simple replication to reconstruction if the source node's replication load exceeds a configurable threshold (hdds.scm.replication.decommission.ec.reconstruction.threshold, default: 5).
    - Feature Flag: Added hdds.scm.replication.decommission.ec.reconstruction.enabled (default: true) to explicitly toggle this behavior.
    - Global Concurrency Limit: Implemented a cluster-wide cap (hdds.scm.replication.decommission.concurrency, default: 100) to prevent cluster-wide performance degradation during large-scale decommissions.
    - Observability: Added the InflightDecommission gauge to ReplicationManagerMetrics to track simultaneous decommissioning tasks.

2. Datanode Scheduling Enhancements:
    - Disk-Aware Volume Selection: The ReplicationSupervisor now tracks in-flight tasks per physical disk volume.
    - Non-Busy Disk Prioritization: The ContainerImporter and DownloadAndImportReplicator were updated to prioritize target volumes with zero in-flight tasks, ensuring better I/O distribution and avoiding disk bottlenecks.

3. Stability and Infrastructure:
    - Bug Fix: Fixed a series of Hugo documentation build errors caused by invalid date formats in markdown front matter.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15014

How was this patch tested?

…witching and disk-aware scheduling

1. SCM Replication Manager Enhancements:
    - Threshold-Based Switching: EC containers on decommissioning nodes now automatically switch from simple replication to reconstruction if the source node's replication load exceeds a configurable threshold (hdds.scm.replication.decommission.ec.reconstruction.threshold, default: 5).
    - Feature Flag: Added hdds.scm.replication.decommission.ec.reconstruction.enabled (default: true) to explicitly toggle this behavior.
    - Global Concurrency Limit: Implemented a cluster-wide cap (hdds.scm.replication.decommission.concurrency, default: 100) to prevent cluster-wide performance degradation during large-scale decommissions.
    - Observability: Added the InflightDecommission gauge to ReplicationManagerMetrics to track simultaneous decommissioning tasks.

2. Datanode Scheduling Enhancements:
    - Disk-Aware Volume Selection: The ReplicationSupervisor now tracks in-flight tasks per physical disk volume.
    - Non-Busy Disk Prioritization: The ContainerImporter and DownloadAndImportReplicator were updated to prioritize target volumes with zero in-flight tasks, ensuring better I/O distribution and avoiding disk bottlenecks.

3. Stability and Infrastructure:
    - Bug Fix: Fixed a series of Hugo documentation build errors caused by invalid date formats in markdown front matter.

Change-Id: I426c6aa8cc5f1321468db13f15591a0a3bf06ac1
Change-Id: I1ac31515d9035f338ca83c07c309fdfedf911677
@adoroszlai
Copy link
Copy Markdown
Contributor

Thanks @jojochuang for working on this.

  • The branch is outdated (2786 commits behind master) and there are conflicts.
  • Based on PR description, I think the set of changes should be split, at least (1) and (2).
  • (3) does not seem to be valid (was already fixed in HDDS-11570).


/* If we get here, the scenario is:
1. Under replicated.
2. Not over replicated.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are all these comments getting removed?

@jojochuang jojochuang changed the title HDDS-15014. Speed up EC container decommission WIP. HDDS-15014. Speed up EC container decommission Apr 17, 2026
@sodonnel
Copy link
Copy Markdown
Contributor

sodonnel commented Apr 17, 2026

There are a few things to think about in this change.

First for a 3-2 container, if there are 3 nodes being decommissioned, reconstruction isn't possible to recover all the decommissioning copies. Ideally, there would be a task scheduled to reconstruct all 3 at a time, but with 3 out that won't work and it will have to pick just 2 to recover and leave the other one for now. Or it could schedule two tasks at the same time for the container.

I know that within Cloudera we had a case where a cluster was showing slow decommission progress on EC containers, but I am not sure if we investigated if the datanode was being kept sufficiently busy. Ie, did it have enough replications scheduled so that its queue did not run empty between each heartbeat. Did it have enough worker threads to keep all the disks busy etc. I think that is important to understand before adding this complexity, but also if the DNs are not being kept busy enough, then it will be the same problem for reconstruction commands too.

It also important from an overall cluster load perspective, especially with large EC schemes (6-3, 10-4) that as much as possible the replications are simple copies from the decommissioning host. The overhead of doing a reconstruction on a container vs a simple copy is 3x the data pulled over the network for 3-2. For 6-3 its 6x for 10-4 its 10x and all that traffic will be cross rack as we try to spread the replicas across as many racks as possible.

At the moment, the logic in the ECUnderReplicationHandler deals with missing indexes, then decommissioning indexes. If it gets a "node over utilised" exception dealing with the decommissioning indexes, the simplest change it to probably mask off the decommissioning indexes up to the parity number, add them to an excluded list and then call processingMissingIndexes again up to some limit.

There should already be a global concurrency limit in RM, so a new one should not be needed.

I also agree with @adoroszlai that this change should be at least 2 or 3 separate changes.

@errose28
Copy link
Copy Markdown
Contributor

Should we review the design doc in #10086 before returning to this change, or are they meant to be reviewed in parallel?

@adoroszlai
Copy link
Copy Markdown
Contributor

Should we review the design doc in #10086 before returning to this change, or are they meant to be reviewed in parallel?

This change cannot be reviewed due to severely outdated dev branch, regardless of the design doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants