Skip to content

HDDS-14665. Add upgrade handling to multipart requests#10062

Open
spacemonkd wants to merge 11 commits into
apache:masterfrom
spacemonkd:HDDS-14665
Open

HDDS-14665. Add upgrade handling to multipart requests#10062
spacemonkd wants to merge 11 commits into
apache:masterfrom
spacemonkd:HDDS-14665

Conversation

@spacemonkd

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

HDDS-14665. Add upgrade handling to multipart requests

Please describe your PR in detail:

  • This PR adds a new layout version to define the split table behaviour for Multipart Uploads
  • It also adds new checks which fails early in case the finalization is not done and the writes try to use the new split table

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14665

How was this patch tested?

Patch was tested using unit tests

@spacemonkd spacemonkd self-assigned this Apr 9, 2026
@spacemonkd spacemonkd added the s3 S3 Gateway label Apr 9, 2026
@spacemonkd spacemonkd marked this pull request as ready for review April 17, 2026 06:04
@spacemonkd spacemonkd marked this pull request as draft April 28, 2026 08:41
@kerneltime

Copy link
Copy Markdown
Contributor

Review

Thanks for landing the upgrade gate ahead of the parts-table-split write path — gating first is the right sequencing. The layout-feature wiring looks correct (no upgrade action / MLV constant needed for a behavior-only feature, and TestOMVersionManager's contiguity check still passes), and the gate's pre-finalization determinism holds: MLV is advanced only by the Ratis-logged OMFinalizeUpgradeRequest, so isAllowed resolves identically across replicas at a given apply index.

I think there's one blocker, and it's not in the gate itself — it's in the CommitPart null-check reorder that came along with it.

Blocker — CommitPart: moving the multipartKeyInfo == null check above getOmKeyInfo() crashes the OM on the commit-after-abort path

To read multipartKeyInfo.getSchemaVersion() safely, the multipartKeyInfo == null -> NO_SUCH_MULTIPART_UPLOAD_ERROR check was moved up to right after getMultipartInfoTable().get(), ahead of omKeyInfo = getOmKeyInfo(...). But the NO_SUCH path is special: S3MultipartUploadCommitPartResponse.checkAndUpdateDB() overrides the flush to GC the orphaned open part and unconditionally dereferences openPartKeyInfoToBeDeleted (the omKeyInfo handed in by the request) — S3MultipartUploadCommitPartResponse.java:88-97 (.getUpdateID(), .getObjectID()).

  • On master, omKeyInfo is loaded and null-checked (-> KEY_NOT_FOUND) before the NO_SUCH throw, so a NO_SUCH response always carried a non-null part to delete.
  • With this PR, NO_SUCH is thrown while omKeyInfo is still null; the catch passes that null into the response. Error responses are enqueued unconditionally (RequestHandler.handleWriteRequest), and OzoneManagerDoubleBuffer calls checkAndUpdateDB for them and terminate()s the OM on any Throwable. So the flush NPEs -> ExitUtils.terminate -> OM exits, and followers re-applying the same entry exit identically.

This is exactly the scenario the GC branch exists for — a part commit arriving after the upload was aborted. testValidateAndUpdateCacheMultipartNotFound already sets up the precondition (random uploadId + addKeyToOpenKeyTable); it stays green only because it asserts the response status and never flushes the double buffer, so checkAndUpdateDB is never run in the unit test. It also affects legacy (schemaVersion 0) uploads, so it's independent of the new feature.

Suggested fix: leave the multipartKeyInfo == null check at its original position (after omKeyInfo is resolved) and put the new gate immediately after it — the gate only needs multipartKeyInfo non-null, which the original ordering already guaranteed there. A regression test that actually flushes the double buffer (or a MiniOzoneCluster commit-after-abort case) would lock it down.

Concerns

  • The allowed (post-finalization) branch is never tested. The base test harness mocks OMLayoutVersionManager, so isAllowed(MPU_PARTS_TABLE_SPLIT) is always false (and MPU_PARTS_TABLE_SPLIT appears in no test). The predicate !isAllowed && schemaVersion != 0 is only ever driven from the left operand — a future inversion (dropped !, wrong feature) would still pass CI. Please add an isAllowed -> true case per request (Abort can assert full success; for Commit/Complete at least assertNotEquals(NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION, ...)).
  • Unrelated change bundled in. The existingKeyInfo swap to getOmKeyInfoFromKeyTable(...) in S3MultipartUploadCompleteRequest is behaviorally inert today (base impl equals the old getKeyTable(...).get(dbOzoneKey); the FSO override only adds setKeyName, which validateAtomicRewrite/validateIfMatchETag don't read) and is unrelated to finalization — and line 333 still uses the raw lookup for the same key. Suggest dropping it here or splitting into its own JIRA.
  • In-apply isAllowed read departs from the established pattern. Every other layout gate — including the EC CLUSTER_NEEDS_FINALIZATION validators in these same two files — runs as a leader-only PRE_PROCESS @RequestFeatureValidator. This is the first to read isAllowed inside the replicated validateAndUpdateCache path, which the method's Javadoc cautions against. It's safe today (MLV advances synchronously via the logged finalize), but that relies on the finalization executor staying synchronous (it's pluggable). Worth considering gating on the request-stamped LayoutVersion from preExecute, or at least a comment documenting the determinism assumption.
  • Dropped FSO coverage. Rewriting testValidateAndUpdateCacheKeyNotFound to initiate a real MPU removes the FSO DIRECTORY_NOT_FOUND branch the old assertion covered; that path is now unasserted in the commit-part suite. Could be restored as a small separate test.

Probably for the write-side PR (flagging so they aren't silent regressions later)

Not reachable until a sibling PR writes schemaVersion != 0, so not blockers here:

  • KeyManagerImpl.listParts reads the inline getPartKeyInfoMap() (empty for split entries) -> would return 200 with zero parts.
  • The background S3ExpiredMultipartUploadsAbortRequest isn't gated and derives quota from the empty inline map -> on a downgraded cluster it would delete the row releasing 0 quota while user-facing Abort is gated off.
  • NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION has no case in S3ErrorTable.translateResultCode -> falls to default -> HTTP 500 to S3 clients.

Nits

  • Complete gate message reads "...for commit part request." — should be the complete path.
  • Double blank line after the gate block in S3MultipartUploadAbortRequest.
  • schemaVersion is uint32 on the wire but narrowed to byte on load; gate tests != 0 while the model special-cases == 1. Not reachable today, but a multiple-of-256 value would truncate to 0 (fail-open) — worth validating to {0,1} at the boundary.

@spacemonkd spacemonkd marked this pull request as ready for review June 5, 2026 10:22
@spacemonkd

Copy link
Copy Markdown
Contributor Author

@kerneltime @ivandika3 it would be great if you could take a look

@spacemonkd spacemonkd requested a review from kerneltime June 5, 2026 20:07

@errose28 errose28 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Not sure if you've had a chance to address the AI comments above as well.

Looks like nothing is setting schema version right now which makes it difficult to review this end to end. Can we add that to this PR?

Comment on lines +171 to +177
if (!ozoneManager.getVersionManager()
.isAllowed(OMLayoutFeature.MPU_PARTS_TABLE_SPLIT)
&& multipartKeyInfo.getSchemaVersion() != 0) {
throw new OMException("MPU parts-table split behavior is not allowed " +
"before cluster finalization.",
OMException.ResultCodes.NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should be a request pre-processor like this. We will be revamping how these pre and post processors are constructed as part of ZDU but we will handle that migration on our feature branch.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check depends on multipartKeyInfo.getSchemaVersion() which is fetched from OM metadata for the upload row.
The validator context right now doesn't have this information exposed in PRE_PROCESS, so the pre-processor isn't implemented as you had provided in the example.

I did change it to use the layout version stamped by the leader in preExecute, so that way it will be more akin to the behaviour of a pre-processor i.e. more deterministic

throw new IllegalArgumentException("Unsupported schemaVersion: "
+ schemaVersion + ". Expected one of [0, 1].");
}
return (byte) schemaVersion;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason why we still keep SchemaVersion as a byte in JVM? In its proto definition schemaVersion is optional uint32 schemaVersion = 10;.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have reverted to int.
Initially I kept as byte since we only support the values as 0 or 1 so we do not need to allocate memory for an integer, but I guess it is easier to read with it being int.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:-) for only 0 or 1 we can go for one bit.

"before cluster finalization for commit part request.",
OMException.ResultCodes.NOT_SUPPORTED_OPERATION_PRIOR_FINALIZATION);
}
if (multipartKeyInfo.getSchemaVersion() == 1) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to learn the context:

Do we need this to be

if (ozoneManager.getVersionManager()
.isAllowed(OMLayoutFeature.MPU_PARTS_TABLE_SPLIT) && 
multipartKeyInfo.getSchemaVersion() == 1)

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had missed the extra check. However I have modified this now.
Please refer to this for more context.

@spacemonkd spacemonkd requested review from amaliujia and errose28 June 9, 2026 09:18
@spacemonkd

Copy link
Copy Markdown
Contributor Author

Thanks for the reviews @errose28, @kerneltime and @amaliujia.

Looks like nothing is setting schema version right now which makes it difficult to review this end to end.

I can add the implementation to this PR itself, but then the code changes on this would increase.
I plan on adding the last PR to this chain after this one which will introduce the implementation and usage of the schemaVersion.
Would that be okay? If not I can merge those changes as part of this PR @errose28

Could you take another look at the new changes? CI is green.

@errose28

errose28 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

My understanding is that schema version to use will be assigned by the leader when the request is received based on whether or not the feature is finalized. Since finalization goes through ratis we would not need the request blocking parts of this change. The request processing would just follow whatever schema version it is given.

I do think it would be helpful to have this whole flow in one PR, even if it increases the size.

@spacemonkd

spacemonkd commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

@errose28 I have added the schemaVersion wiring as well.
Could you take another look?

Also:

if (splitPartsFeatureAllowed
          && multipartKeyInfo.getSchemaVersion() == 1) {
        throw new OMException("MPU parts-table split commit path is not " +
            "supported in this write flow.",
            OMException.ResultCodes.NOT_SUPPORTED_OPERATION);
      }

added this block so that the current builds for Ozone doesn't give error as we do not have the implementation for the actual split writing of the table. I will remove this in the follow up PR to this which will introduce the wiring.

I have pinned the schemaVersion to 0 otherwise it was giving errors with implementation missing, but the overall helpers and checks are in place to give a better idea of the end to end flow

@amaliujia could you please take a look as well?

@amaliujia amaliujia left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM

@errose28

Copy link
Copy Markdown
Contributor

Thanks for the update @spacemonkd. We should only be checking MLV to set the schema when the request is first initiated and before it is submitted to Ratis. All subsequent requests should then use the schema version in the proto metadata as the source of truth. This can be done in a pre-processor because it only needs to mutate the incoming proto's schema version based on the current MLV, then all future processing will obey that schema version.

Finalization goes through Ratis so we don't need to worry about MLV diverging between OMs.

@spacemonkd

Copy link
Copy Markdown
Contributor Author

@errose28 are the latest changes now in line with what you meant?
I changed it to have a new check for MPU_SPLIT_TABLE allowed.
Also note that in the new pre-processor check setSchemaVersionOnInitiateMultipartUpload runs on both before and after finalize.

This constant 0 will later on be switched based on the condition whether split table is allowed or not - once we have the implementation ready in the next PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

s3 S3 Gateway

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants