Skip to content

DAOS-19008 control: Erase formatting after failed format --replace#18446

Open
tanabarr wants to merge 6 commits into
masterfrom
tanabarr/control-fmtreplace-rank-erase
Open

DAOS-19008 control: Erase formatting after failed format --replace#18446
tanabarr wants to merge 6 commits into
masterfrom
tanabarr/control-fmtreplace-rank-erase

Conversation

@tanabarr

@tanabarr tanabarr commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Prevent a scenario where an engine may rejoin with an unintended new
rank following a failed dmg storage format --replace operation.

If the join-replace fails, the storage remains formatted and the
engine is left in a state where it is treated as a fresh instance and
will join the system with a new rank ID when restarted. Avoid this
inadvertent creation of a new rank by removing the engine's superblock
immediately after join-replace failure. Then the engine will require
an explicit format command before being able to join the system.

Also includes cleanup and test improvements:

  • Simplify and deduplicate server unit test setup code
  • Remove restriction blocking join-replace when rank is in
    AdminExcluded state

Steps for the author:

  • Commit message follows the guidelines.
  • Appropriate Features or Test-tag pragmas were used.
  • Appropriate Functional Test Stages were run.
  • At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • Gatekeeper requested (daos-gatekeeper added as a reviewer).

@tanabarr tanabarr requested review from a team as code owners June 5, 2026 14:11
@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Ticket title is 'Aurora daos_user: PMEM Device should Unmount and revert the --replace operation fully if it fails'
Status is 'In Review'
Labels: 'ALCF'
https://daosio.atlassian.net/browse/DAOS-19008

@daosbuild3

Copy link
Copy Markdown
Collaborator

@daosbuild3

Copy link
Copy Markdown
Collaborator

Comment thread src/control/cmd/dmg/storage.go Outdated
cmd.Debugf("Invoking SystemErase to clean up after failed format operation")

eraseReq := &control.SystemEraseReq{}
eraseResp, err := control.SystemErase(ctx, cmd.ctlInvoker, eraseReq)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work... SystemErase doesn't allow you to choose ranks or nodes.

I think you'll need to handle this from the daos_server that owns the engine. If the engine fails to join, and it's a replace operation, blow the storage away. The failure that triggered this request was happening at the join stage.

If the format itself fails, I don't think there's any risk of the engine coming up. If there's a partial failure, it's not a bad idea to clean up, but I think that would have to happen from the server side, too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right , this needs a rework

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch 3 times, most recently from d8cf409 to 6785032 Compare June 8, 2026 20:46
@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-18446/4/testReport/

@daosbuild3

Copy link
Copy Markdown
Collaborator

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-18446/4/execution/node/1394/log

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 6785032 to 309615f Compare June 9, 2026 11:34
@daosbuild3

Copy link
Copy Markdown
Collaborator

@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 309615f to a593fcb Compare June 9, 2026 13:55
@daosbuild3

Copy link
Copy Markdown
Collaborator

@tanabarr tanabarr changed the title Erase formatting after failed format --replace DAOS-19008 control: Erase formatting after failed format --replace Jun 9, 2026
@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch 2 times, most recently from 21e8fb7 to 862f9d5 Compare June 9, 2026 15:57
@daosbuild3

Copy link
Copy Markdown
Collaborator

tanabarr added 3 commits June 10, 2026 12:25
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr force-pushed the tanabarr/control-fmtreplace-rank-erase branch from 862f9d5 to 550d853 Compare June 12, 2026 20:06
}
if len(storageCfg.Tiers) != 0 {
if err := ei.WriteSuperblock(); err != nil {
//storage.FormatControlMetadata([]uint{0}); err != nil {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code removed in follow-up PR #18127

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in this PR because of repush

@tanabarr tanabarr requested review from kjacque, knard38 and mjmac June 15, 2026 11:15

@mjmac mjmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks generally fine to me... Only potential blocker is the removal of the guard against allowing an AdminExcluded rank to join. Not sure if that was intentional as a convenience or an oversight, but IMO it should be a requirement for the admin to explicitly unexclude a rank in order to avoid accidents.

t.Error("unexpected unmount call")
}

if !tc.expSBCreated {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no test case that sets this to be true, so the remainder of the test is dead code. Maybe copy/paste from another test?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the test cases that used it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines -201 to -203
if cm.State == MemberStateAdminExcluded {
return nil, ErrJoinAdminExcluded(cm.UUID, cm.Rank)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this intentional? An admin-excluded rank should have to be explicitly admin-unexcluded before it can rejoin, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only allowing during dmg storage format --replace (in joinReplace()), this change is the missing piece from https://daosio.atlassian.net/browse/DAOS-18472 which unfortunately only contained a partial fix because I had the rest (the bit you are referring to) in a working directory which didn't get committed! The change will not affect the normal join case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I read over the various tickets. I understand the intent behind this change, which is to interpret the admin's action of running a storage format --replace as an implicit clearing of the AdminExcluded state on a rank. I agree that usually this will be correct, and that it seems like a nice convenience.

However, I am concerned that this policy change weakens AdminExcluded from being a state with clearly-defined entry and exit criteria to being a bit more wishy-washy. It sets a precedent that the state can be automatically cleared based on criteria that may not be obvious without reading code, and in my mind this erodes the value of having an AdminExcluded state in addition to the programmatically-toggled Excluded state.

The correct response here, IMO, is not to weaken AdminExcluded in order to work around an admin error or UX deficiency, it's to figure out how to make the correct order of operations (admin excludes rank, repairs rank, unexcludes rank, rejoins rank) clearer to the admin.

If there is consensus that there is not strong value in having an explicit AdminExcluded state, then IMO it would be better to just remove it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree that this weakens the position of the state but given I'm unlikely to get a consensus I will revert this change. You are correct that this was a workaround for admin error scenarios but in my mind it strengthens the UX rather than weakening it. Check replaced.

Comment thread src/control/server/instance_storage.go Outdated
storageProv := ei.GetStorage()

// Get SCM config to access mount point and class
scmCfg, err := storageProv.GetScmConfig()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scmCfg isn't really used for anything here, as far as I can tell. GetScmConfig() will return a non-nil error if there's no SCM config (or at least it should), so checking for a nil scmCfg seems like an anti pattern.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cruft removed from stale implementation

"github.com/daos-stack/daos/src/control/lib/ranklist"
"github.com/daos-stack/daos/src/control/logging"
"github.com/daos-stack/daos/src/control/provider/system"
sysprov "github.com/daos-stack/daos/src/control/provider/system"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate import of the provider/system package.

@tanabarr tanabarr Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed "system" import because of ambiguity with the package of the same name

tanabarr added 2 commits June 16, 2026 14:07
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
…treplace-rank-erase

Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
Features: control
Signed-off-by: Tom Nabarro <thomas.nabarro@hpe.com>
@tanabarr tanabarr requested a review from mjmac June 16, 2026 15:07

@mjmac mjmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm open to the idea of having a discussion about implicit clear of AdminExcluded, or the removal of the state entirely if we don't think it's actually useful. But I think that should be a separate conversation and patch rather than sneaking it in as part of something else.

@tanabarr tanabarr self-assigned this Jun 17, 2026
@tanabarr tanabarr added the control-plane work on the management infrastructure of the DAOS Control Plane label Jun 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

control-plane work on the management infrastructure of the DAOS Control Plane

Development

Successfully merging this pull request may close these issues.

4 participants