Skip to content

refactor: make join projection pushdown schema-aware via ColumnIndex/…#23185

Open
Phoenix500526 wants to merge 3 commits into
apache:mainfrom
Phoenix500526:refactor/23010
Open

refactor: make join projection pushdown schema-aware via ColumnIndex/…#23185
Phoenix500526 wants to merge 3 commits into
apache:mainfrom
Phoenix500526:refactor/23010

Conversation

@Phoenix500526

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

try_pushdown_through_join (datafusion/physical-plan/src/projection.rs)
decided whether each projected join-output column belonged to the left or right
child by comparing its output index against the left child's field count
(join_table_borders) — i.e. it assumed the join output is a plain
left ++ right concatenation.

That assumption does not hold for all join types. A LeftMark / RightMark
join appends a synthetic mark boolean column that has JoinSide::None and
originates from neither child, so reasoning from output position can route the
mark column to the wrong child. This is the panic that #22902 worked around by
disabling projection pushdown for LeftMark / RightMark in HashJoinExec
and NestedLoopJoinExec — correct, but it left the helper reasoning from output
position rather than from the join's actual output-origin contract.

What changes are included in this PR?

  • Thread the join's output-origin metadata (&[ColumnIndex], already computed by
    build_join_schema) into try_pushdown_through_join.
  • Replace the output-position split (join_table_borders +
    index < left_field_count) with grouping by ColumnIndex.side:
    • Left / Right columns are collected per child using the child-relative
      ColumnIndex.index;
    • a JoinSide::None (synthetic, e.g. mark) column makes the pushdown decline
      (Ok(None)), so the projection is embedded into the join instead — no panic,
      no mis-routing.
  • Remove the LeftMark | RightMark bypass from
    HashJoinExec::try_swapping_with_projection and
    NestedLoopJoinExec::try_swapping_with_projection; they now call
    try_pushdown_through_join uniformly.
  • The shared helpers used by cross / symmetric-hash / sort-merge join pushdown
    (new_join_children, join_table_borders, join_allows_pushdown,
    update_join_on, update_join_filter) keep their signatures. The side-grouped
    path uses a new private new_join_children_from_groups and reuses
    update_join_on / update_join_filter with a zero column-index offset (the
    groups already carry child-relative indices).

Are these changes tested?

Yes.

  • New subquery.slt cases place a subset/reordering projection over a mark join
    — hash LeftMark (IN under OR), negated mark (NOT EXISTS under OR),
    and nested-loop mark (non-equi correlated) — asserting both the query result
    and the EXPLAIN plan shape.
  • Existing coverage is preserved: cargo test -p datafusion-physical-plan projection (incl. the filter-pushdown and join_table_borders unit tests),
    the join *.slt suite, and subquery.slt all pass.
  • cargo clippy -p datafusion-physical-plan -- -D warnings and cargo fmt --all
    are clean.

Are there any user-facing changes?

No behavior change: the produced physical plans are unchanged for the supported
cases (mark joins keep falling back to the same embedded-projection plan as
before, regular joins push down as before).

One signature change to a pub helper: try_pushdown_through_join gains a
column_indices: &[ColumnIndex] parameter (all in-tree callers are updated). If
the project treats this helper as part of the public API surface, please add the
api change label.

@github-actions github-actions Bot added sqllogictest SQL Logic Tests (.slt) physical-plan Changes to the physical-plan crate labels Jun 25, 2026
…JoinSide

try_pushdown_through_join decided whether each projected join-output column
belonged to the left or right child by comparing its output index against the
left child's field count, assuming a `left ++ right` output schema. That is
wrong for joins whose output is not a plain concatenation -- mark joins append a
synthetic `mark` column (JoinSide::None) belonging to neither child -- which is
why apache#22902 had to bypass projection pushdown for LeftMark / RightMark.

Group the projected columns by the join's output-origin metadata
(ColumnIndex.side) instead of output position: Left/Right columns push to the
matching child via the child-relative ColumnIndex.index, and a synthetic
(JoinSide::None) column makes the pushdown decline so the projection is embedded
into the join instead -- no panic, no mis-routing. This lets the LeftMark /
RightMark bypass be removed from HashJoinExec and NestedLoopJoinExec.

The shared helpers used by cross / symmetric / sort-merge joins
(new_join_children, join_table_borders, update_join_on/filter) keep their
signatures; the side-grouped path uses a new new_join_children_from_groups and
reuses update_join_on/filter with a zero column-index offset. Mark joins still
fall back to embedding (pushing their child columns down is a follow-up).

Closes apache#23010

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-physical-plan v54.0.0 (current)
       Built [  29.610s] (current)
     Parsing datafusion-physical-plan v54.0.0 (current)
      Parsed [   0.120s] (current)
    Building datafusion-physical-plan v54.0.0 (baseline)
       Built [  28.718s] (baseline)
     Parsing datafusion-physical-plan v54.0.0 (baseline)
      Parsed [   0.111s] (baseline)
    Checking datafusion-physical-plan v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.743s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure function_missing: pub fn removed or renamed ---

Description:
A publicly-visible function cannot be imported by its prior path. A `pub use` may have been removed, or the function itself may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/function_missing.ron

Failed in:
  function datafusion_physical_plan::projection::try_pushdown_through_join, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/3540c85e2032555168d2dabf4e2501175021bab8/datafusion/physical-plan/src/projection.rs:645

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  61.268s] datafusion-physical-plan
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 145.273s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.019s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 142.493s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.019s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.093s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 291.158s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 25, 2026
It is a crate-internal helper, called only by HashJoinExec and
NestedLoopJoinExec within datafusion-physical-plan, is not re-exported,
and has no external consumers. Reducing its visibility keeps it off the
public API surface so future signature changes no longer trip
cargo-semver-checks.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Phoenix500526
Thanks for working on this!
I think the projection pushdown approach makes sense overall, but I have a couple of API surface concerns that I'd like to address before merging.

}

/// column indices
pub fn column_indices(&self) -> &Vec<ColumnIndex> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add a new public HashJoinExec::column_indices(&self) -> &Vec<ColumnIndex> API just so projection pushdown can pass this internal join metadata to a crate-local helper.

The follow up commit makes try_pushdown_through_join pub(crate) specifically to avoid expanding the public API surface, so this accessor feels a bit at odds with that goal. It also exposes the internal Vec representation, which becomes part of the semver contract.

Since the call sites are within the same impl, could we just pass &self.column_indices or self.column_indices.as_slice() directly? If an accessor is still preferred, I think it should be pub(crate) and return &[ColumnIndex] instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

/// column indices
pub fn column_indices(&self) -> &Vec<ColumnIndex> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here. This NestedLoopJoinExec::column_indices(&self) -> &Vec<ColumnIndex> accessor is only needed by the internal projection pushdown code, so I'd rather not expose it as a public &Vec API.

Could we either access the field directly from the impl or make the accessor pub(crate) and have it return &[ColumnIndex] instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Remove public column_indices accessors from hash and nested-loop join
execution nodes. Projection pushdown now passes the internal
column_indices slice from within each impl, avoiding new public API
surface and keeping the Vec representation private.

Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
@Phoenix500526 Phoenix500526 requested a review from kosiew July 1, 2026 07:53

@kosiew kosiew left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Phoenix500526
Thanks for the follow-up. The accessor comments look addressed now. I think there is still one blocking semver issue to fix.

}

pub fn try_pushdown_through_join(
pub(crate) fn try_pushdown_through_join(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to preserve the existing public try_pushdown_through_join API here. cargo-semver-checks is flagging this as function_missing, and downstream users that import datafusion_physical_plan::projection::try_pushdown_through_join would break without a major version bump.

Could we keep the old public function path and signature as a compatibility wrapper, then move the new ColumnIndex aware implementation into a crate-private helper for the join call sites? For example, HashJoinExec and NestedLoopJoinExec could call a renamed pub(crate) helper that accepts &[ColumnIndex], while the existing public API remains available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: Make join projection pushdown schema-aware via ColumnIndex / JoinSide

2 participants