DAOS-19008 control: Erase formatting after failed format --replace#18446
DAOS-19008 control: Erase formatting after failed format --replace#18446tanabarr wants to merge 6 commits into
Conversation
|
Ticket title is 'Aurora daos_user: PMEM Device should Unmount and revert the --replace operation fully if it fails' |
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/1/testReport/ |
|
Test stage Functional on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18446/1/execution/node/983/log |
| cmd.Debugf("Invoking SystemErase to clean up after failed format operation") | ||
|
|
||
| eraseReq := &control.SystemEraseReq{} | ||
| eraseResp, err := control.SystemErase(ctx, cmd.ctlInvoker, eraseReq) |
There was a problem hiding this comment.
I don't think this will work... SystemErase doesn't allow you to choose ranks or nodes.
I think you'll need to handle this from the daos_server that owns the engine. If the engine fails to join, and it's a replace operation, blow the storage away. The failure that triggered this request was happening at the join stage.
If the format itself fails, I don't think there's any risk of the engine coming up. If there's a partial failure, it's not a bad idea to clean up, but I think that would have to happen from the server side, too.
There was a problem hiding this comment.
right , this needs a rework
d8cf409 to
6785032
Compare
|
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/4/testReport/ |
|
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18446/4/execution/node/1394/log |
6785032 to
309615f
Compare
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/5/testReport/ |
309615f to
a593fcb
Compare
|
Test stage NLT completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/6/testReport/ |
21e8fb7 to
862f9d5
Compare
|
Test stage Unit Test completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/8/testReport/ |
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
862f9d5 to
550d853
Compare
| } | ||
| if len(storageCfg.Tiers) != 0 { | ||
| if err := ei.WriteSuperblock(); err != nil { | ||
| //storage.FormatControlMetadata([]uint{0}); err != nil { |
There was a problem hiding this comment.
commented code removed in follow-up PR #18127
There was a problem hiding this comment.
removed in this PR because of repush
mjmac
left a comment
There was a problem hiding this comment.
Looks generally fine to me... Only potential blocker is the removal of the guard against allowing an AdminExcluded rank to join. Not sure if that was intentional as a convenience or an oversight, but IMO it should be a requirement for the admin to explicitly unexclude a rank in order to avoid accidents.
| t.Error("unexpected unmount call") | ||
| } | ||
|
|
||
| if !tc.expSBCreated { |
There was a problem hiding this comment.
There is no test case that sets this to be true, so the remainder of the test is dead code. Maybe copy/paste from another test?
There was a problem hiding this comment.
removed the test cases that used it
| if cm.State == MemberStateAdminExcluded { | ||
| return nil, ErrJoinAdminExcluded(cm.UUID, cm.Rank) | ||
| } |
There was a problem hiding this comment.
Was this intentional? An admin-excluded rank should have to be explicitly admin-unexcluded before it can rejoin, no?
There was a problem hiding this comment.
This is only allowing during dmg storage format --replace (in joinReplace()), this change is the missing piece from https://daosio.atlassian.net/browse/DAOS-18472 which unfortunately only contained a partial fix because I had the rest (the bit you are referring to) in a working directory which didn't get committed! The change will not affect the normal join case.
There was a problem hiding this comment.
Hmm. I read over the various tickets. I understand the intent behind this change, which is to interpret the admin's action of running a storage format --replace as an implicit clearing of the AdminExcluded state on a rank. I agree that usually this will be correct, and that it seems like a nice convenience.
However, I am concerned that this policy change weakens AdminExcluded from being a state with clearly-defined entry and exit criteria to being a bit more wishy-washy. It sets a precedent that the state can be automatically cleared based on criteria that may not be obvious without reading code, and in my mind this erodes the value of having an AdminExcluded state in addition to the programmatically-toggled Excluded state.
The correct response here, IMO, is not to weaken AdminExcluded in order to work around an admin error or UX deficiency, it's to figure out how to make the correct order of operations (admin excludes rank, repairs rank, unexcludes rank, rejoins rank) clearer to the admin.
If there is consensus that there is not strong value in having an explicit AdminExcluded state, then IMO it would be better to just remove it.
There was a problem hiding this comment.
I don't agree that this weakens the position of the state but given I'm unlikely to get a consensus I will revert this change. You are correct that this was a workaround for admin error scenarios but in my mind it strengthens the UX rather than weakening it. Check replaced.
| storageProv := ei.GetStorage() | ||
|
|
||
| // Get SCM config to access mount point and class | ||
| scmCfg, err := storageProv.GetScmConfig() |
There was a problem hiding this comment.
scmCfg isn't really used for anything here, as far as I can tell. GetScmConfig() will return a non-nil error if there's no SCM config (or at least it should), so checking for a nil scmCfg seems like an anti pattern.
There was a problem hiding this comment.
cruft removed from stale implementation
| "github.com/daos-stack/daos/src/control/lib/ranklist" | ||
| "github.com/daos-stack/daos/src/control/logging" | ||
| "github.com/daos-stack/daos/src/control/provider/system" | ||
| sysprov "github.com/daos-stack/daos/src/control/provider/system" |
There was a problem hiding this comment.
This is a duplicate import of the provider/system package.
There was a problem hiding this comment.
removed "system" import because of ambiguity with the package of the same name
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…treplace-rank-erase Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
There was a problem hiding this comment.
LGTM. I'm open to the idea of having a discussion about implicit clear of AdminExcluded, or the removal of the state entirely if we don't think it's actually useful. But I think that should be a separate conversation and patch rather than sneaking it in as part of something else.
Prevent a scenario where an engine may rejoin with an unintended new
rank following a failed
dmg storage format --replaceoperation.If the join-replace fails, the storage remains formatted and the
engine is left in a state where it is treated as a fresh instance and
will join the system with a new rank ID when restarted. Avoid this
inadvertent creation of a new rank by removing the engine's superblock
immediately after join-replace failure. Then the engine will require
an explicit format command before being able to join the system.
Also includes cleanup and test improvements:
AdminExcluded state
Steps for the author:
After all prior steps are complete: