Skip to content

[core] introducing row-range-keyed DeletionVectors for DataEvolution tables#8344

Closed
steFaiz wants to merge 6 commits into
apache:masterfrom
steFaiz:dv_for_de_core_support
Closed

[core] introducing row-range-keyed DeletionVectors for DataEvolution tables#8344
steFaiz wants to merge 6 commits into
apache:masterfrom
steFaiz:dv_for_de_core_support

Conversation

@steFaiz

@steFaiz steFaiz commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Purpose

The first part of #8322

This is a very BIG pr, I tried to split it into several sub PRs, but found it very hard.

The PR only focuses on core module, leaving Spark/Flink module untouched, mainly includes:

  1. RowRange DeletionFiles are fully decoupled with data files, they are just to split one big deletion vector into many small files.
  2. Introduce a DeletionFileKey to replace current filename as the unique identifier of each DeletionFile
  3. Introduce a DataEvolutionApplyDvReader to filter out deleted row ids. This class refers to IndexedSplitRecordReader: First enrich readType by _ROW_ID, then filter rows by RowRange DeletionVectors.
  4. Slightly changed SnapshotReaderImpl, assigning overlapping row range DeletionVectors to each DataSplit.

Besides, most changes are about changing many String fields to DeletionFileKey.

Tests

Unit Tests only.

@steFaiz steFaiz marked this pull request as draft June 24, 2026 09:15
@steFaiz steFaiz marked this pull request as ready for review June 25, 2026 03:07
return false;
}

return currentDv.isDeleted(rowId);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reader only evaluates the first row-range DV whose end has not passed the current row id. If two row-range keyed DVs overlap, for example [0, 10] and [5, 15], rows 5-10 are never checked against the second DV, so deletions recorded only there will be returned. I don't see the writer/manifest combiner rejecting overlapping RowIdRangeKey values; they only deduplicate exact keys. Please either enforce non-overlapping row-range DVs when writing/combining metadata, or make this reader evaluate all DVs that may contain the row.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your remind! I've added the check in GlobalCombiner and added a test

? null
: IndexFileMetaSerializer.dvMetasToRowArrayData(dvMetas.values()),
: IndexFileMetaSerializer.metasToRowArrayData(
dvMetas, DeletionFileKey.Type.FILE_NAME),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For row-range keyed DVs this renders the compatibility marker row instead of the real row-id ranges, because metasToRowArrayData(..., FILE_NAME) intentionally fast-fails old readers. That makes table_indexes.dv_ranges misleading for data-evolution tables. Can we expose the row-range DV schema here as well, or render the ranges in a separate column, instead of showing the legacy marker?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very helpful! I've unified FileNameKey and RowRangeKey:

  1. Reusing the same field
  2. For FileNames, display the file name
  3. For RowRanges, display the formated range. e.g. [0, 100]

@steFaiz

steFaiz commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

I thought about the design and I think that making row ranged-DV aligned with normal file ranges is much more simple and easy to manage, compared to fully decoupled mode. Just like filename-based implementation.
The main logic:

  1. At first DV generation, generate DVs aligned with existing normal data file ranges
  2. During compaction, if some data files triggered row-level compaction and are merge into larger data file range, we merge related DVs too.

This is because it's complicated to update DVs if we let DVs can be fully decouple, consider this scenario:
Current DV ranges:

// Current DV:
[0, 100] , [101, 200], [230, 300]

Update a new DV range [50, 250]
This will involves:

  1. among all DVs, binary search to find all overlapping DVs. here we find: [0, 100] , [101, 200] , [230, 300]
  2. check each DV:
    a. For [0, 100], it's only partial overlapping with [50, 250], so we may have to split the new DV into [50, 100], [101, 250]
    b. For [101, 200] and [230, 300], we need to merge it with current splitted new DV: [101, 250], but how to merge them? Merge into a single big range [101, 300]? or split it into some selected size?

If we don't design sophiscated merge/split mechanism, there might be some DVs' size too big or too small.

I think we could:

  1. At read path: capable of reading any decoupled DV ranges (already implemented in this PR)
  2. At write/compaction path: force DV ranges be aligned with existing normal file ranges.

@JingsongLi Could you please review this proposal and share your thoughts? Any feedback would be greatly appreciated!

@JingsongLi

Copy link
Copy Markdown
Contributor

@steFaiz If we move in this direction, do we maintain another simple design:

Set the DV according to the oldest data file, still using the file name, which will not introduce any new mechanism and maintain the old DV mechanism.

@steFaiz

steFaiz commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

@JingsongLi Thanks for your reply! This design: "Reference to the oldest file" is also in my original design docs.
I choose current range-based because:

  1. range-based approach is more natural and easy to search
  2. The oldest-filename approach can reuse the existing DV key, but it introduces an implicit anchor-file
    requirement. Today DataEvolutionFileStoreScan can prune files by writeCols: for SELECT c, it only needs files
    containing column c. If DVs are attached to the oldest file, the scan still needs to keep or lookup that
    oldest file only for finding DVs. That makes projection planning depend on an unrelated physical file. For example:
    a. Data files could be filtered by writeCols, if full schema is [a, b, c, d] and some files only contains [c]
    b. For query like SELECT c from t, current implementation only need all files with writeCols = [c]
    c. Now we have to keep the oldest file in plan (or we cannot find related DVs)
    This logic seems weird. Like "if delation-vector is enabled, we need to plan all column files as well as the oldest file".
  3. In theory, for compaction without materialize DVs (materialize DVs need to rebuild global indices), DVs do not need to rewrite for range-based approach (Rewriting DV is low-cost, I think it's trivial)

I agree that the oldest filename based approach will reduce code changes.

Which one do the community prefer? range based or the oldest filename based? I'll modify this PR!

@JingsongLi

Copy link
Copy Markdown
Contributor

@steFaiz Thank you very much for your reply, it is very valuable. Indeed, both options will bring different levels of complexity, and we need to choose between them. I am also very conflicted because the complexity they bring is almost of the same magnitude.

I prefer the older filename based approach because it preserves the original DV mechanism and eliminates the need to worry about maintaining the row id ranges. We can consider retaining the original file information in DV mode until the DV is generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants