Skip to content

fix: correct stale unix_timestamp NTZ and date_format codegen-default doc text (#4502)#4645

Open
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/datetime-audit-highprio
Open

fix: correct stale unix_timestamp NTZ and date_format codegen-default doc text (#4502)#4645
andygrove wants to merge 2 commits into
apache:mainfrom
andygrove:fix/datetime-audit-highprio

Conversation

@andygrove

@andygrove andygrove commented Jun 12, 2026

Copy link
Copy Markdown
Member

Which issue does this PR close?

Closes #4502 (date/time expression audit follow-ups). The remaining open work, gating non-default StringTypeWithCollation inputs on Spark 4.0 datetime expressions (item 5), is carved out into #4646.

Rationale for this change

The high-priority items in #4502 break down as follows after the codegen-dispatch migration and spark.comet.exec.scalaUDF.codegen.enabled defaulting to true:

  1. CometUnixTimestamp.getUnsupportedReasons claimed TimestampNTZType is unsupported "because Comet incorrectly applies timezone conversion to TimestampNTZ values". The native unix_timestamp path has a dedicated Timestamp(Microsecond, None) branch that divides raw microseconds by 1_000_000 with no timezone conversion, matching Spark (ToTimestamp.eval does raw micros / MICROS_PER_SECOND for both TimestampType and TimestampNTZType). The predicate (isSupportedInputType) and getSupportLevel already treat TimestampNTZType as supported; only the reason text was out of date.

  2. CometDateFormat.getCompatibleNotes described the codegen dispatcher as "disabled (default)". The flag now defaults to true, so the note was inaccurate.

  3. The native date_diff subtracted Date32 day values with a plain i32 -, which panics in debug builds when the result overflows. Spark computes the difference with JVM int subtraction, which wraps. This is practically unreachable for valid DateType inputs but is an unmasked panic path.

Items 2, 3, and 4 are obsolete after the codegen-dispatch migration and are intentionally not changed (details below).

What changes are included in this PR?

  • CometUnixTimestamp.getUnsupportedReasons now reads "Only DateType, TimestampType, and TimestampNTZType inputs are supported.", consistent with isSupportedInputType and getSupportLevel.
  • CometDateFormat.getCompatibleNotes now states that spark.comet.exec.scalaUDF.codegen.enabled=true is the default.
  • Native date_diff uses wrapping_sub to match Spark's wrapping int subtraction and remove the debug-only panic path, with regression tests for basic and overflow inputs.

Intentionally not changed (obsolete after the codegen-dispatch migration and the scalaUDF.codegen.enabled default flipping to true):

  • Item 2 (move CometDateFormat gating into getSupportLevel): marking non-UTC + whitelisted formats Incompatible would now route them to full Spark fallback under the default config instead of through the dispatcher, which produces Spark-identical results. That is a regression, not a fix.
  • Item 3 (mark Group A expressions Incompatible / surface the dispatcher flag in getCompatibleNotes): CometCodegenDispatch deliberately reports Compatible() and omits the flag note, which is accurate now that the dispatcher is on by default. When it is disabled, convert returns None with a withFallbackReason that shows up in EXPLAIN. The only stale artifact was the CometDateFormat note fixed above.
  • Item 4 (CometMakeTimestamp ANSI): CometMakeTimestamp is now CometCodegenDispatch[MakeTimestamp], so Spark's own doGenCode (which honors failOnError) runs in the kernel and throws under ANSI mode; when the dispatcher is disabled it falls back to Spark. ANSI behavior is correct in both configurations.

How are these changes tested?

  • The getUnsupportedReasons / getCompatibleNotes corrections are EXPLAIN/compatibility-doc text with no behavior change. The native unix_timestamp TimestampNTZ path the corrected text describes is covered by existing Rust unit tests in native/spark-expr/src/datetime_funcs/unix_timestamp.rs. mvnw compile is clean for the spark module.
  • The date_diff change adds test_date_diff_basic and test_date_diff_wraps_on_overflow in native/spark-expr/src/datetime_funcs/date_diff.rs, both passing. cargo build and cargo clippy -D warnings are clean for the datafusion-comet-spark-expr crate.

…default note

CometUnixTimestamp.getUnsupportedReasons claimed TimestampNTZType is
unsupported because Comet incorrectly applies timezone conversion, but the
native unix_timestamp path has a dedicated TimestampNTZ branch that divides
raw microseconds without any timezone conversion, matching Spark. Align the
reason text with isSupportedInputType and getSupportLevel, both of which
already treat TimestampNTZType as supported.

CometDateFormat.getCompatibleNotes described the codegen dispatcher as
disabled by default; spark.comet.exec.scalaUDF.codegen.enabled now defaults
to true. Correct the note.
…bug panic

The native date_diff subtracted Date32 day values with a plain i32 -, which
panics in debug builds when the result overflows i32. Spark computes the day
difference with JVM int subtraction, which wraps. Switch to wrapping_sub to
match Spark and remove the debug-only panic path, and add regression tests
covering basic and overflow inputs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

date/time expression audit follow-ups (from #4448)

1 participant