Fix PyArrow stack-overflow segfault in PyIceberg's multi-column upsert match filter#3509
Open
steven-winfield-quantohm wants to merge 3 commits into
Open
Fix PyArrow stack-overflow segfault in PyIceberg's multi-column upsert match filter#3509steven-winfield-quantohm wants to merge 3 commits into
steven-winfield-quantohm wants to merge 3 commits into
Conversation
rambleraptor
left a comment
Collaborator
There was a problem hiding this comment.
Can we get some additional tests? I want to make sure that we're not breaking anything. Here's a couple test cases I can think of, but I'd love to get your thoughts:
- Multiple columns, multiple prefix groups
- Multiple columns, single prefix group
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3508
Rationale for this change
The problem
When upserting into an Iceberg table, PyIceberg first scans the target table to
find which existing rows match the source rows' key columns. It builds that
"matching" predicate in
pyiceberg.table.upsert_util.create_match_filter:For a single join column it emits one flat
In(col, [v1, v2, ...]).PyArrow lowers this to a single
is_incompute node, no matter how manyvalues it contains — so single-column upserts of huge tables are fine.
For a multi-column key it instead emits one disjunct per distinct key
tuple::
PyIceberg builds that
Oras a balanced tree, so the Python side copes.But when the expression is handed to PyArrow's dataset scanner as a filter, the
C++ expression engine canonicalises it:
Dataset::GetFragmentscallsSimplifyWithGuarantee→Canonicalize, which flattens the associativeor_kleenechain and then recurses over it. With tens of thousands ofdisjuncts that recursion overflows the C++ call stack and the process
segfaults (SIGSEGV) — typically after several minutes of work, with a
backtrace full of
arrow::compute::Canonicalize/ModifyExpressionframes.
Reference: #3272
Note that #3448 addresses a different upsert segfault (a
per-batch Acero re-filter in
_task_to_record_batches, mostly observed onApple Silicon). It does not touch the
GetFragmentscanonicalisation pathexercised here, so it does not help with this crash.
The fix
Produce a predicate that matches exactly the same rows, but with far fewer
disjuncts. Group the key tuples and emit a single
Inover whichever columncollapses to the fewest distinct "prefix" combinations (choosing that column
makes the result independent of the caller's column ordering)::
The disjunct count drops from "number of rows" to "number of distinct prefix
values". In the synthetic data below there are 50 000 unique ids spread over
just 50 group values, so the predicate shrinks from 50 000 disjuncts to 50 —
shallow enough that PyArrow's canonicaliser no longer overflows.
Caveat
This helps whenever at least one key column is low-cardinality (or, equivalently,
one column is near-unique and can be folded into the
In). A genuinelyhigh-cardinality composite key — where every column is near-unique and all of
them are needed to identify a row — still produces roughly one disjunct per row
even after grouping, and can still overflow. For that pathological case the
only robust option is to upsert in smaller batches.
Are these changes tested?
Yes - one test changed to expect
And(op1, op2)orAnd(op2, op1)where previouslythe operand order mattered.
Are there any user-facing changes?
No