WIP. HDDS-15014. Speed up EC container decommission by jojochuang · Pull Request #10082 · apache/ozone

jojochuang · 2026-04-16T05:42:05Z

What changes were proposed in this pull request?

HDDS-15014. Speed up EC container decommission

Please describe your PR in detail:

1. SCM Replication Manager Enhancements:
    - Threshold-Based Switching: EC containers on decommissioning nodes now automatically switch from simple replication to reconstruction if the source node's replication load exceeds a configurable threshold (hdds.scm.replication.decommission.ec.reconstruction.threshold, default: 5).
    - Feature Flag: Added hdds.scm.replication.decommission.ec.reconstruction.enabled (default: true) to explicitly toggle this behavior.
    - Global Concurrency Limit: Implemented a cluster-wide cap (hdds.scm.replication.decommission.concurrency, default: 100) to prevent cluster-wide performance degradation during large-scale decommissions.
    - Observability: Added the InflightDecommission gauge to ReplicationManagerMetrics to track simultaneous decommissioning tasks.

2. Datanode Scheduling Enhancements:
    - Disk-Aware Volume Selection: The ReplicationSupervisor now tracks in-flight tasks per physical disk volume.
    - Non-Busy Disk Prioritization: The ContainerImporter and DownloadAndImportReplicator were updated to prioritize target volumes with zero in-flight tasks, ensuring better I/O distribution and avoiding disk bottlenecks.

3. Stability and Infrastructure:
    - Bug Fix: Fixed a series of Hugo documentation build errors caused by invalid date formats in markdown front matter.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15014

How was this patch tested?

…witching and disk-aware scheduling 1. SCM Replication Manager Enhancements: - Threshold-Based Switching: EC containers on decommissioning nodes now automatically switch from simple replication to reconstruction if the source node's replication load exceeds a configurable threshold (hdds.scm.replication.decommission.ec.reconstruction.threshold, default: 5). - Feature Flag: Added hdds.scm.replication.decommission.ec.reconstruction.enabled (default: true) to explicitly toggle this behavior. - Global Concurrency Limit: Implemented a cluster-wide cap (hdds.scm.replication.decommission.concurrency, default: 100) to prevent cluster-wide performance degradation during large-scale decommissions. - Observability: Added the InflightDecommission gauge to ReplicationManagerMetrics to track simultaneous decommissioning tasks. 2. Datanode Scheduling Enhancements: - Disk-Aware Volume Selection: The ReplicationSupervisor now tracks in-flight tasks per physical disk volume. - Non-Busy Disk Prioritization: The ContainerImporter and DownloadAndImportReplicator were updated to prioritize target volumes with zero in-flight tasks, ensuring better I/O distribution and avoiding disk bottlenecks. 3. Stability and Infrastructure: - Bug Fix: Fixed a series of Hugo documentation build errors caused by invalid date formats in markdown front matter. Change-Id: I426c6aa8cc5f1321468db13f15591a0a3bf06ac1

Change-Id: I1ac31515d9035f338ca83c07c309fdfedf911677

adoroszlai · 2026-04-16T06:13:20Z

Thanks @jojochuang for working on this.

The branch is outdated (2786 commits behind master) and there are conflicts.
Based on PR description, I think the set of changes should be split, at least (1) and (2).
(3) does not seem to be valid (was already fixed in HDDS-11570).

sodonnel · 2026-04-17T09:07:35Z

-
-      /* If we get here, the scenario is:
-      1. Under replicated.
-      2. Not over replicated.


Why are all these comments getting removed?

sodonnel · 2026-04-17T16:05:26Z

There are a few things to think about in this change.

First for a 3-2 container, if there are 3 nodes being decommissioned, reconstruction isn't possible to recover all the decommissioning copies. Ideally, there would be a task scheduled to reconstruct all 3 at a time, but with 3 out that won't work and it will have to pick just 2 to recover and leave the other one for now. Or it could schedule two tasks at the same time for the container.

I know that within Cloudera we had a case where a cluster was showing slow decommission progress on EC containers, but I am not sure if we investigated if the datanode was being kept sufficiently busy. Ie, did it have enough replications scheduled so that its queue did not run empty between each heartbeat. Did it have enough worker threads to keep all the disks busy etc. I think that is important to understand before adding this complexity, but also if the DNs are not being kept busy enough, then it will be the same problem for reconstruction commands too.

It also important from an overall cluster load perspective, especially with large EC schemes (6-3, 10-4) that as much as possible the replications are simple copies from the decommissioning host. The overhead of doing a reconstruction on a container vs a simple copy is 3x the data pulled over the network for 3-2. For 6-3 its 6x for 10-4 its 10x and all that traffic will be cross rack as we try to spread the replicas across as many racks as possible.

At the moment, the logic in the ECUnderReplicationHandler deals with missing indexes, then decommissioning indexes. If it gets a "node over utilised" exception dealing with the decommissioning indexes, the simplest change it to probably mask off the decommissioning indexes up to the parity number, add them to an excluded list and then call processingMissingIndexes again up to some limit.

There should already be a global concurrency limit in RM, so a new one should not be needed.

I also agree with @adoroszlai that this change should be at least 2 or 3 separate changes.

errose28 · 2026-04-20T16:14:42Z

Should we review the design doc in #10086 before returning to this change, or are they meant to be reviewed in parallel?

adoroszlai · 2026-04-20T17:15:34Z

Should we review the design doc in #10086 before returning to this change, or are they meant to be reviewed in parallel?

This change cannot be reviewed due to severely outdated dev branch, regardless of the design doc.

jojochuang added 2 commits March 20, 2026 11:30

restore markdown

9f30144

Change-Id: I1ac31515d9035f338ca83c07c309fdfedf911677

jojochuang added the AI-gen label Apr 16, 2026

sodonnel reviewed Apr 17, 2026

View reviewed changes

jojochuang changed the title ~~HDDS-15014. Speed up EC container decommission~~ WIP. HDDS-15014. Speed up EC container decommission Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP. HDDS-15014. Speed up EC container decommission#10082

WIP. HDDS-15014. Speed up EC container decommission#10082
jojochuang wants to merge 2 commits intoapache:masterfrom
jojochuang:high-throughput-ec-decom

jojochuang commented Apr 16, 2026

Uh oh!

adoroszlai commented Apr 16, 2026

Uh oh!

sodonnel Apr 17, 2026

Uh oh!

sodonnel commented Apr 17, 2026 •

edited

Loading

Uh oh!

errose28 commented Apr 20, 2026

Uh oh!

adoroszlai commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jojochuang commented Apr 16, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai commented Apr 16, 2026

Uh oh!

sodonnel Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

errose28 commented Apr 20, 2026

Uh oh!

adoroszlai commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sodonnel commented Apr 17, 2026 •

edited

Loading