[feat](olap) Support lazy reading mode for pruned complex columns#59263
[feat](olap) Support lazy reading mode for pruned complex columns#59263mrhhsg wants to merge 11 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
5ff30aa to
b39fda3
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
b39fda3 to
c734a83
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
c734a83 to
ca5a57c
Compare
|
run buildall |
ca5a57c to
1734d3f
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 34393 ms |
TPC-DS: Total hot run time: 174201 ms |
ClickBench: Total hot run time: 28 s |
FE Regression Coverage ReportIncrement line coverage |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1734d3f to
91706c6
Compare
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 35535 ms |
TPC-DS: Total hot run time: 175439 ms |
ClickBench: Total hot run time: 27.99 s |
Issue Number: None Related PR: apache#59263 Problem Summary: Pruned complex columns can separate predicate subcolumns from non-predicate subcolumns. Predicate branches must be read before filter evaluation, while non-predicate branches can be deferred until after filtering to avoid unnecessary IO and materialization. This change adds the scan and iterator state needed to keep predicate, lazy, meta-only, and skipped branches distinct for struct/map/array columns, recovers placeholder columns after lazy materialization, and keeps nested iterator reading flags explicit. It also avoids generating TopN runtime filters for unsupported complex order key types so BE does not receive invalid runtime predicate keys. Support lazy reading of non-predicate fields for pruned complex columns, and avoid unsupported complex TopN runtime filter keys. - Test: Unit Test / Build / Format - `./run-fe-ut.sh --run org.apache.doris.nereids.postprocess.TopNRuntimeFilterTest#testNotUseTopNRfForUnsupportedComplexOrderKey` - `ninja -C be/ut_build_ASAN src/exec/CMakeFiles/Exec.dir/operator/olap_scan_operator.cpp.o src/exec/CMakeFiles/Exec.dir/scan/olap_scanner.cpp.o src/storage/CMakeFiles/Storage.dir/segment/column_reader.cpp.o src/storage/CMakeFiles/Storage.dir/segment/segment_iterator.cpp.o test/CMakeFiles/doris_be_test.dir/storage/segment/column_reader_test.cpp.o` - `build-support/clang-format.sh be/src/exec/operator/olap_scan_operator.cpp be/src/exec/operator/olap_scan_operator.h be/src/exec/scan/olap_scanner.cpp be/src/runtime/runtime_state.h be/src/storage/olap_common.h be/src/storage/segment/column_reader.cpp be/src/storage/segment/column_reader.h be/src/storage/segment/segment_iterator.cpp be/src/storage/segment/segment_iterator.h be/test/storage/segment/column_reader_test.cpp` - `git diff --check && git diff --cached --check` - Behavior changed: Yes. Non-predicate subcolumns of pruned complex columns can be read lazily, and unsupported complex TopN order keys no longer create runtime filters. - Does this need documentation: No
Issue Number: None Related PR: apache#59263 Problem Summary: Nested column pruning stripped predicate metadata paths when a non-predicate data path covered the same nested container for lazy materialization. A query that projected a nested map-array-struct field but filtered with cardinality(map['key']) could evaluate the predicate without value-array offsets and filter out matching rows. Keep predicate metadata paths covered by predicate-phase paths, preserve final predicate metadata paths even when lazy materialization covers the corresponding data path, and avoid forwarding current-level array metadata predicate paths to item iterators. Fix nested complex column pruning for predicates that need array/map metadata. - Test: - Build: `~/.codex/skills/doris-local-regression/scripts/doris-local-regression.sh --network 10.26.20.3/24 all -d nereids_rules_p0/column_pruning -s string_length_column_pruning` built FE/BE successfully. - Regression test: `~/.codex/skills/doris-local-regression/scripts/doris-local-regression.sh --network 10.26.20.3/24 start && ~/.codex/skills/doris-local-regression/scripts/doris-local-regression.sh --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s string_length_column_pruning`. - Code style: `build-support/clang-format.sh be/src/storage/segment/column_reader.cpp`, `build-support/check-format.sh be/src/storage/segment/column_reader.cpp`, `git diff --check`, and `git diff --cached --check`. - Static analysis: attempted `build-support/run-clang-tidy.sh --build-dir be/build_Release`, but it failed on existing analysis diagnostics outside the changed lines and a missing system `stddef.h` resource include. - Behavior changed: Yes. Nested predicate metadata paths are preserved so filtered lazy reads evaluate predicates with the required array/map metadata. - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: None Problem Summary: Clarify the NestedColumnPruning phase 1.5 comment so it matches the current predicate metadata path stripping logic. Predicate access paths are stripped only by predicate-phase paths, while all access paths are stripped by self-covering paths. ### Release note None ### Check List (For Author) - Test: No need to test (comment-only change); ran git diff --check and git diff --cached --check - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#59263 Problem Summary: Nested column pruning could treat metadata paths used by predicate evaluation as redundant when a covering data path also existed for final lazy materialization. This is unsafe because predicate evaluation still needs current-level metadata, such as OFFSET for cardinality()/length() and NULL for IS NULL, before row filtering. Removing these paths could make BE enter metadata-only modes incorrectly or forward current-level metadata paths to child iterators. This change keeps predicate metadata paths in the FE plan, removes the obsolete MetaPathStriper logic, and updates BE nested column iterators to consume current-level metadata paths at the correct iterator level without letting redundant metadata paths switch mixed data reads into meta-only mode. Regression expectations are updated to assert the retained metadata-path contract. ### Release note None ### Check List (For Author) - Test: - Regression test: `doris-local-regression --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning` - Manual test: `build-support/clang-format.sh` - Manual test: `git diff --check` - Behavior changed: No - Does this need documentation: No
Issue Number: None
Related PR: None
Problem Summary: Predicate evaluation can require nested access paths even when final materialization does not need any data access path at the current complex iterator level. The complex column iterators previously returned early when all access paths were empty, which ignored predicate-only metadata or child access paths. This could leave predicate-only nested columns unread in lazy read mode and produce incorrect predicate results for cases such as array element IS NULL predicates. This change only skips access-path setup when both final access paths and predicate access paths are empty.
None
- Test: Unit Test and Regression test
- Unit Test: ./run-be-ut.sh --run --filter=ColumnReaderTest.*
- Regression test: doris-local-regression --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s string_length_column_pruning
- Regression test: doris-local-regression --network 10.26.20.3/24 run -d datatype_p0/complex_types -s test_pruned_columns
- Regression test: doris-local-regression --network 10.26.20.3/24 run -d inverted_index_p0/array_contains -s test_index_compaction_null_arr
- Manual test: build-support/clang-format.sh
- Manual test: git diff --check
- Behavior changed: No
- Does this need documentation: No
Issue Number: None Related PR: apache#59263 Problem Summary: Struct lazy-read access-path routing could skip a child that only appeared in predicate access paths. For example, with projection path `s.a` and predicate path `s.b`, `StructFileColumnIterator::set_access_paths()` decided whether to route a child only from ordinary projection access paths, so child `b` was marked as skipped before its predicate path was forwarded. This change collects both projection and predicate subpaths for each child first, and routes the child when either side requires it. None - Test: Unit Test - `./run-be-ut.sh --run --filter=ColumnReaderTest.StructPredicateOnlyChildPathStillRoutesToChild:ColumnReaderTest.MapFullProjectionStillRoutesPredicateSubPaths:ColumnReaderTest.AccessPathsPropagatePredicateToChildren` - `./run-be-ut.sh --run --filter=ColumnReaderTest.*` - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve?
Issue Number: None
Related PR: None
Problem Summary: Add focused coverage for ColumnReader nested iterator routing, including access-path modes, prefetcher collection, next_batch behavior, and read_by_rowids behavior for STRUCT/ARRAY/MAP nullable and meta-only cases. Extend the nested container offset pruning regression case to cover mixed empty/non-empty containers and predicate/output access-path combinations.
### Release note
None
### Check List (For Author)
- Test: Regression test / Unit Test
- DORIS_HOME=/mnt/disk7/hushenggang/doris-fix-spill ninja -C be/ut_build_ASAN test/CMakeFiles/doris_be_test.dir/storage/segment/column_reader_test.cpp.o
- DORIS_HOME=/mnt/disk7/hushenggang/doris-fix-spill ./run-be-ut.sh --run --filter=ColumnReaderTest.*
- doris-local-regression.sh --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s nested_container_offset_pruning -forceGenOut
- doris-local-regression.sh --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s nested_container_offset_pruning
- build-support/clang-format.sh be/test/storage/segment/column_reader_test.cpp
- build-support/check-format.sh
- git diff --check
- Behavior changed: No
- Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#59263 Problem Summary: Extract lazy-pruned column recovery in SegmentIterator into a private helper so the internal rowid mapping, lazy read phase, read_by_rowids invocation, and finalize_lazy_phase behavior can be tested directly. Add white-box UT coverage for non-empty filtered selection and empty selection finalize path. Fix ColumnReader NULL_MAP_ONLY read_by_rowids so a no-null page only fills the sparse rowids that actually belong to the current page. Keep predicate NULL/OFFSET paths out of final allAccessPaths when regular data paths also exist, while preserving them in predicateAccessPaths, so mixed-version BEs do not switch a mixed data/meta read to current-level metadata-only mode. Also remove the racy regression-test profile cleanup and cover lazy-pruned output recovery through a profile-token SQL shape. ### Release note None ### Check List (For Author) - Test: Unit Test / Regression test - ./run-be-ut.sh --run --filter=SegmentIteratorLazyPrunedTest.* - ./run-be-ut.sh --run --filter=ColumnReaderTest.* - ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest - /mnt/disk7/hushenggang/.codex/skills/doris-local-regression/scripts/doris-local-regression.sh --network 10.26.20.3/24 all -d datatype_p0/complex_types -s test_pruned_columns - Behavior changed: No - Does this need documentation: No
### What problem does this PR solve? Issue Number: None Related PR: apache#59263 Problem Summary: Refresh nereids column pruning regression explain expectations to match the current separation between data access paths and predicate metadata access paths. Predicate-only NULL/OFFSET paths now stay in predicate access paths, while all access paths keep the data path needed by projection. ### Release note None ### Check List (For Author) - Test: Regression test - ./run-regression-test.sh --conf output/local-regression/regression-conf-46001.groovy --run -d nereids_rules_p0/column_pruning -s null_column_pruning - ./run-regression-test.sh --conf output/local-regression/regression-conf-46001.groovy --run -d nereids_rules_p0/column_pruning - Behavior changed: No - Does this need documentation: No
|
/review |
There was a problem hiding this comment.
Reviewed live head 4c14891. I did not find additional non-duplicate issues to raise as inline comments.
Scope checked: nested column pruning access-path preservation, lazy-pruned BE iterator read phases, session option forwarding, TopN runtime-filter type gate, and updated regression/unit-test expectations. Duplicate suppression: the remaining DorisVector include problem in segment_iterator.cpp is already covered by an existing inline thread, so I did not repeat it.
Validation was static only in this runner: .worktree_initialized, thirdparty/installed, and thirdparty/installed/bin/protoc are missing, so I did not run Doris builds or tests here.
|
Codex automated review failed and did not complete. Error: Codex completed, but no new pull request review was submitted for the current head SHA. Please inspect the workflow logs and rerun the review after the underlying issue is resolved. |
Issue Number: None Related PR: apache#59263 Problem Summary: Predicate-only NULL/OFFSET metadata paths were normalized to their stripped data prefix when the same slot also had a regular data path. For map element paths such as m.*.OFFSET, this turned the path into broad m.* before map-star expansion, causing BE to read full map values instead of the exact keys and value metadata needed by the predicate plus the projected value field. Preserve map-star metadata paths until expansion so they become explicit KEYS and VALUES.NULL/OFFSET paths, while ordinary current-level metadata paths still use the stripped form for compatibility. None - Test: - Unit Test: ./run-fe-ut.sh --run org.apache.doris.nereids.rules.rewrite.PruneNestedColumnTest,org.apache.doris.nereids.postprocess.TopNRuntimeFilterTest - Regression test: doris-local-regression --network 10.26.20.3/24 all -d nereids_rules_p0/column_pruning -s string_length_column_pruning - Regression test: doris-local-regression --network 10.26.20.3/24 start && doris-local-regression --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s null_column_pruning - Regression test: doris-local-regression --network 10.26.20.3/24 start && doris-local-regression --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s nested_container_offset_pruning - Regression test: doris-local-regression --network 10.26.20.3/24 start && doris-local-regression --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning - Behavior changed: No - Does this need documentation: No
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
Automated review completed for current head 1d6b983336d76d39ff2796f2d501eadfb394f4aa.
I did not find any new non-duplicate inline issue to submit. The remaining mixed metadata/data access-path compatibility concern found by the subagents is already covered by the existing review thread at NestedColumnPruning.java (discussion_r3487344198), so I suppressed it as a duplicate instead of re-commenting.
Checkpoint conclusions:
- Goal and proof: the PR implements lazy recovery for pruned complex common-expression columns, preserves predicate metadata paths separately, forwards
enable_prune_nested_columnto BE, and adds BE/FE/regression coverage for the changed access-path shapes. - Scope: the implementation is large but concentrated in FE access-path construction, BE complex column iterators, lazy read phase handling, and tests.
- Concurrency/lifecycle: no new shared concurrent mutable state found. Iterator read phase is scoped and restored; placeholder recovery and the variant owner wrapper avoid the main lifecycle hazards I checked.
- Config/protocol compatibility: the new query option is forwarded from
SessionVariable.toThrift()to BERuntimeState. Mixed-version allAccessPaths behavior remains the already-raised thread, not a new distinct comment. - Parallel paths: array/map/struct/string, predicate-only paths, map-star expansion, nullable metadata, variant wrapper forwarding, TopN runtime filter type gating, and common-expression lazy recovery were reviewed.
- Tests/outputs: added BE unit tests, FE unit tests, and regression expected outputs cover the main changed paths. I only ran static validation (
git diff --check) because this runner lacks.worktree_initialized,thirdparty/installed, andthirdparty/installed/bin/protoc. - Observability:
LazyReadPrunedTimeis added and regression coverage polls for a completed profile before asserting it. - Transactions/persistence/data writes: not applicable to this PR.
- User focus:
.code-review.VmpkJW/review_focus.txthad no additional focus points.
Subagent conclusions:
optimizer-rewrite: reportedOPT-1, duplicate of existingdiscussion_r3487344198; no new optimizer/rewrite issue after convergence.tests-session-config: reportedTSC-1, duplicate ofOPT-1/existing thread; no new test, session, config, compatibility, or basic CI issue after convergence.- Convergence round 1 ended with both live subagents replying
NO_NEW_VALUABLE_FINDINGSfor the same current ledger and empty proposed final inline comment set.
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage |
TPC-H: Total hot run time: 29993 ms |
TPC-DS: Total hot run time: 174017 ms |
ClickBench: Total hot run time: 25.96 s |
|
PR approved by at least one committer and no changes requested. |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
What problem does this PR solve?
The subcolumns of a pruned complex-type column can fall into two categories:
For non-predicate columns, Doris can defer reading them until after predicate evaluation, reducing unnecessary I/O for nested columns.
This update also preserves predicate metadata paths separately from final lazy-materialization data paths. Predicate evaluation may still need current-level metadata, such as
OFFSETforcardinality()/length()andNULLforIS NULL, even when a covering data path is also needed later. The BE nested column iterators consume those current-level metadata paths at the correct iterator level without forwarding them to child iterators or incorrectly switching mixed data reads into metadata-only mode.This PR also handles predicate-only nested access paths. A complex iterator should skip access-path setup only when both final access paths and predicate access paths are empty. Otherwise predicate-only child or metadata paths may be ignored in lazy read mode, causing predicates such as array element
IS NULLto evaluate on unread placeholder data.This PR also removes references to
olap/rowset/segment_v2/column_reader.hfrom many header files, avoiding large-scale recompilation of source files caused by changes toColumnReader/ColumnIterator.Issue Number: None
Related PR: None
Release note
For non-predicate columns, Doris can defer reading them until after predicate evaluation, which may significantly reduce the amount of data read.
Check List (For Author)
Test
doris-local-regression --network 10.26.20.3/24 run -d nereids_rules_p0/column_pruning -s string_length_column_pruningdoris-local-regression --network 10.26.20.3/24 run -d datatype_p0/complex_types -s test_pruned_columnsdoris-local-regression --network 10.26.20.3/24 run -d inverted_index_p0/array_contains -s test_index_compaction_null_arr./run-be-ut.sh --run --filter=ColumnReaderTest.*build-support/clang-format.shgit diff --checkBehavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)