Skip to content

HDDS-15034. Query SCM status for ozone admin upgrade status command#10084

Open
dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
dombizita:HDDS-15034
Open

HDDS-15034. Query SCM status for ozone admin upgrade status command#10084
dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
dombizita:HDDS-15034

Conversation

@dombizita
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

After #10011 is merged the hardcoded placeholder responses can be removed and connect it to SCM for real values. Based on @errose28's suggestion I used HDDSLayoutVersionManager to check the finalization status of SCM and added a new counter to SCMNodeManager to keep track of the number of DNs finalized and used that for the ozone admin upgrade status output.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-15034

How was this patch tested?

Added tests, green CI on my fork: https://github.com/dombizita/ozone/actions/runs/24517013218

@dombizita dombizita requested review from errose28 and sodonnel April 17, 2026 09:13
}
}

private static boolean isDatanodeFinalized(LayoutVersionProto layoutVersion) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make this public static and then re-use it in the NodeManager.java class too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, changed it accordingly.

@github-actions github-actions bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Apr 17, 2026
|| UpgradeFinalization.isDone(scmUpgradeStatus);
}

static boolean shouldFinalize(UpgradeFinalization.Status scmUpgradeStatus) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this - the shouldFinalize boolean is to instruct other components to finalize when SCM AND the datanodes are done finalizing. So this should be true only if SCM is finalized and all DNs are finalized. The idea is that OM will poll this when OM is unfinalized and it will automatically finalize when this goes true. So I think we need to check num_dns == finalizated_dns && scm == finalized

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh, this makes sense, that if OM will poll this both of these conditions should be fulfilled. I added this check and modified the tests.

testProcessLayoutVersionReportHigherMlv();
}

@Test
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a test for registration when a DN is finalized on registration, and not finalized on registration and when the remove_node is called too. This test covers the heartbeat part I think

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, added tests for this scenario.

@dombizita
Copy link
Copy Markdown
Contributor Author

Thank you for the review @sodonnel, addressed your comments in the latest commit.

.build();
UpgradeFinalization.Status scmUpgradeStatus =
scm.getLayoutVersionManager().getUpgradeState();
int totalDatanodes = scm.getScmNodeManager().getAllNodeCount();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that nodeManager probably remembers any dead or decommissioned nodes and they will never finalize. We actually want the count of all IN_SERVICE and HEALTHY nodes as those are the ones that should finalize or end up going dead as they are not responding. So I think we need to use this API:

public int getNodeCount(NodeStatus status) {
    return countNodes(matching(status));
  }

Where the NodeStatus is IN_SERVICE + HEALTHY.

* metadata/software comparison.
*/
default int getNumDatanodesFinalized() {
return (int) getAllNodes().stream()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be the IN_SERVICE + HEALTHY node, rather than all. I guess it might be OK iterating all nodes, but there is a chance that there is some DEAD node that was finalized, and then in the other place where we get the total datanode count it could make the counts match when they should not. Eg we have a finalized decommissioned node, and an unfinalized in_service node and the decommissioned one counts +1 when it should not.

Really what we want to ensure is that all nodes that can finalize, are finalized. Nodes unable to finalize, by being dead should not block finalization.

I think a dead node has to be registered when it comes back alive, at least if the DN process is restarted. If the DN process is just hung for some time preventing the heart beats, the node can be moved back to IN_SERVICE.

I wonder if such a node should be forced to finalize before it rejoins - certainly on registration we can make a node finalize. I think this is out of scope for this PR, but we need to think about it as another change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants