HDDS-15034. Query SCM status for ozone admin upgrade status command#10084
HDDS-15034. Query SCM status for ozone admin upgrade status command#10084dombizita wants to merge 3 commits intoapache:HDDS-14496-zdufrom
Conversation
| } | ||
| } | ||
|
|
||
| private static boolean isDatanodeFinalized(LayoutVersionProto layoutVersion) { |
There was a problem hiding this comment.
You could make this public static and then re-use it in the NodeManager.java class too.
There was a problem hiding this comment.
Sounds good, changed it accordingly.
| || UpgradeFinalization.isDone(scmUpgradeStatus); | ||
| } | ||
|
|
||
| static boolean shouldFinalize(UpgradeFinalization.Status scmUpgradeStatus) { |
There was a problem hiding this comment.
I am not sure about this - the shouldFinalize boolean is to instruct other components to finalize when SCM AND the datanodes are done finalizing. So this should be true only if SCM is finalized and all DNs are finalized. The idea is that OM will poll this when OM is unfinalized and it will automatically finalize when this goes true. So I think we need to check num_dns == finalizated_dns && scm == finalized
There was a problem hiding this comment.
Ohh, this makes sense, that if OM will poll this both of these conditions should be fulfilled. I added this check and modified the tests.
| testProcessLayoutVersionReportHigherMlv(); | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
I think we need a test for registration when a DN is finalized on registration, and not finalized on registration and when the remove_node is called too. This test covers the heartbeat part I think
There was a problem hiding this comment.
Thanks, added tests for this scenario.
|
Thank you for the review @sodonnel, addressed your comments in the latest commit. |
| .build(); | ||
| UpgradeFinalization.Status scmUpgradeStatus = | ||
| scm.getLayoutVersionManager().getUpgradeState(); | ||
| int totalDatanodes = scm.getScmNodeManager().getAllNodeCount(); |
There was a problem hiding this comment.
I just realised that nodeManager probably remembers any dead or decommissioned nodes and they will never finalize. We actually want the count of all IN_SERVICE and HEALTHY nodes as those are the ones that should finalize or end up going dead as they are not responding. So I think we need to use this API:
public int getNodeCount(NodeStatus status) {
return countNodes(matching(status));
}
Where the NodeStatus is IN_SERVICE + HEALTHY.
| * metadata/software comparison. | ||
| */ | ||
| default int getNumDatanodesFinalized() { | ||
| return (int) getAllNodes().stream() |
There was a problem hiding this comment.
I think this should be the IN_SERVICE + HEALTHY node, rather than all. I guess it might be OK iterating all nodes, but there is a chance that there is some DEAD node that was finalized, and then in the other place where we get the total datanode count it could make the counts match when they should not. Eg we have a finalized decommissioned node, and an unfinalized in_service node and the decommissioned one counts +1 when it should not.
Really what we want to ensure is that all nodes that can finalize, are finalized. Nodes unable to finalize, by being dead should not block finalization.
I think a dead node has to be registered when it comes back alive, at least if the DN process is restarted. If the DN process is just hung for some time preventing the heart beats, the node can be moved back to IN_SERVICE.
I wonder if such a node should be forced to finalize before it rejoins - certainly on registration we can make a node finalize. I think this is out of scope for this PR, but we need to think about it as another change.
What changes were proposed in this pull request?
After #10011 is merged the hardcoded placeholder responses can be removed and connect it to SCM for real values. Based on @errose28's suggestion I used HDDSLayoutVersionManager to check the finalization status of SCM and added a new counter to SCMNodeManager to keep track of the number of DNs finalized and used that for the
ozone admin upgrade statusoutput.What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-15034
How was this patch tested?
Added tests, green CI on my fork: https://github.com/dombizita/ozone/actions/runs/24517013218