[improvement](eager-agg) Push down distinct aggregation using statistics#65099
[improvement](eager-agg) Push down distinct aggregation using statistics#65099feiniaofeiafei wants to merge 1 commit into
Conversation
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: The distinct aggregation pushdown used a separate cost-based rewrite and supported only one distinct aggregate. Move pure distinct aggregation into eager aggregation so statistics decide whether to deduplicate below joins. Support multiple SUM/COUNT distinct aggregates with identical keys and multi-column COUNT DISTINCT while keeping the upper distinct aggregate unchanged. Queries containing non-distinct aggregates or different distinct keys are not rewritten.
### Release note
Use statistics-driven eager aggregation for pure same-key distinct aggregates across joins.
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.eageraggregation.EagerAggRewriterTest
- Behavior changed: Yes. Eligible distinct aggregations can now be deduplicated below joins according to eager aggregation statistics.
- Does this need documentation: No
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
|
/review |
There was a problem hiding this comment.
Automated review completed. I found one blocking compatibility issue around the removed Nereids rule enum.
Critical checkpoints:
- Goal: the PR moves distinct aggregate pushdown into eager aggregation and broadens same-key distinct handling, but it does not preserve the old user-addressable rule name.
- Scope/focus: the code change is otherwise focused on FE/Nereids rewrite wiring, eager aggregation, and the associated unit test.
- Concurrency/lifecycle: no new concurrent state, locks, persistence, or lifecycle-sensitive objects were introduced.
- Configuration/compatibility: blocking issue found. Existing
disable_nereids_rules/SET_VAR/ rule-hint usage can still name the removed enum and now fails throughRuleType.valueOf. - Parallel paths: reviewed rewrite order, the removed cost-based distinct rule, eager aggregation, distinct aggregate rewrite, and same-key/multi-column distinct paths.
- Tests: new FE unit coverage was added, but I could not run
./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.eageraggregation.EagerAggRewriterTestin this checkout becausethirdparty/installedandthirdparty/installed/bin/protocare missing. - User focus: no additional user-provided focus points were supplied.
Subagent conclusions: optimizer-rewrite reported NO_NEW_VALUABLE_FINDINGS. tests-session-config reported TEST-1, which was merged as a duplicate of MAIN-1 and became the single inline comment. Convergence round 1 ended with both live subagents reporting NO_NEW_VALUABLE_FINDINGS for the same ledger/comment set.
| ELIMINATE_SORT(RuleTypeClass.REWRITE), | ||
|
|
||
| PUSH_DOWN_AGG_THROUGH_JOIN_ONE_SIDE(RuleTypeClass.REWRITE), | ||
| PUSH_DOWN_AGG_WITH_DISTINCT_THROUGH_JOIN_ONE_SIDE(RuleTypeClass.REWRITE), |
There was a problem hiding this comment.
Removing this enum makes existing rule-control settings fail during parsing. disable_nereids_rules is a forwarded session variable, and SessionVariable#getDisableNereidsRules() still calls RuleType.valueOf(ruleName) for every configured name, so a deployment or query hint that currently has PUSH_DOWN_AGG_WITH_DISTINCT_THROUGH_JOIN_ONE_SIDE saved will now throw IllegalArgumentException instead of just disabling a now-removed implementation. Since this rule name was user-addressable through disable_nereids_rules/CBO rule hints, please keep the enum value as a no-op compatibility alias, or add explicit migration handling for the old name.
morrySnow
left a comment
There was a problem hiding this comment.
Left one finding from the code review — using order-sensitive List.equals for multi-column distinct key comparison could miss optimization opportunities.
| } | ||
| if (commonDistinctKeys == null) { | ||
| commonDistinctKeys = distinctKeys; | ||
| } else if (!commonDistinctKeys.equals(distinctKeys)) { |
There was a problem hiding this comment.
PLAUSIBLE: getCommonDistinctKeys uses List.equals() to compare distinct keys, which is order-sensitive. For multi-column distinct aggregates like COUNT(DISTINCT a, b) and SUM(DISTINCT b, a) in the same query, the two functions share the same distinct tuple {a, b} but the argument lists are in different orders. [a, b].equals([b, a]) returns false, so the method returns Optional.empty() and the optimization is skipped entirely.
Impact: No wrong results — the fallback is to leave the original plan unchanged. But the dedup-pushdown optimization is missed for this (admittedly unusual) edge case.
Suggested fix: Sort the distinct key list before comparison (e.g., by ExprId), or convert to a Set<SlotReference> for order-independent comparison.
| } else if (!commonDistinctKeys.equals(distinctKeys)) { | |
| } else if (distinctKeys.size() != commonDistinctKeys.size() | |
| || !new HashSet<>(commonDistinctKeys).containsAll(distinctKeys)) { |
TPC-H: Total hot run time: 29928 ms |
TPC-DS: Total hot run time: 174976 ms |
ClickBench: Total hot run time: 25.34 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: The distinct aggregation pushdown used a separate cost-based rewrite and supported only one distinct aggregate. Move pure distinct aggregation into eager aggregation so statistics decide whether to deduplicate below joins. Support multiple SUM/COUNT distinct aggregates with identical keys and multi-column COUNT DISTINCT while keeping the upper distinct aggregate unchanged. Queries containing non-distinct aggregates or different distinct keys are not rewritten.
Release note
Use statistics-driven eager aggregation for pure same-key distinct aggregates across joins.
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)