Add Store.get_many bulk fetch API#4112
Draft
TomNicholas wants to merge 1 commit into
Draft
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4112 +/- ##
==========================================
+ Coverage 93.50% 93.53% +0.02%
==========================================
Files 90 90
Lines 11981 12013 +32
==========================================
+ Hits 11203 11236 +33
+ Misses 778 777 -1
🚀 New features to boost your workflow:
|
Add a public, overridable `Store.get_many` that retrieves many values at once - each request being a whole key or a `(key, byte_range)` pair. It generalizes `Store.get_ranges` (many ranges of one key) to many keys, and yields `(request_index, Buffer | None)` batches in completion order so a store can coalesce reads that land in the same underlying object. The ABC default fetches requests concurrently with `get`, so every store works out of the box; stores with a bulk backend override it (`FsspecStore` coalesces via fsspec's `cat_ranges`). Coalescing tuning is left to each store rather than exposed on the interface. This restores and generalizes the batched-fetch capability of the v2 `getitems` Store API (see zarr-developersgh-1806).
24890a9 to
9e6eeae
Compare
TomNicholas
added a commit
to TomNicholas/VirtualiZarr
that referenced
this pull request
Jul 1, 2026
Implement Store.get_many on ManifestStore: resolve the requested chunk keys through the manifests to (source file, byte range), group the requests by source file, and fetch each group with obstore's range-coalescing reader so that virtual references lying within `coalesce_max_gap_bytes` of each other in the same file are served by a single, larger request instead of one request per chunk. Keys that are not plain manifest-backed chunks (metadata, inlined, or missing chunks) are served individually via `get`. This is the same technique object_store / async-tiff use to read many tiles efficiently, applied to virtual chunk references, and derives an "effective shard index" from the manifests at read time. It requires no file-format assumptions and no spec changes. The coalescing gap is configurable via a new `coalesce_max_gap_bytes` constructor argument (default 1 MiB, 0 disables). Depends on the cross-key `Store.get_many` API in zarr-python (zarr-developers/zarr-python#4112, #4113); until that is released the method is a dormant override. See zarr-developers#947.
TomNicholas
added a commit
to TomNicholas/VirtualiZarr
that referenced
this pull request
Jul 1, 2026
Implement Store.get_many on ManifestStore: resolve the requested chunk keys through the manifests to (source file, byte range), group the requests by source file, coalesce each group into runs, and serve each run with a single ranged read that is sliced back into per-chunk buffers. Keys that are not plain manifest-backed chunks (metadata, inlined, or missing chunks) are served individually via `get`. Coalescing uses two knobs (the object_store / async-tiff model): a run merges references whose gap is <= `coalesce_max_gap_bytes` as long as the resulting read stays <= `coalesce_max_bytes`. The gap defaults to 0 - merge only adjacent references, a pure win with no wasted bytes - because benchmarking a 2D map-tile query against a remote Met Office file showed that bridging larger gaps pulls in the chunks that sit between rows (a 2D box is contiguous along one axis but strided along the other), reading ~3x the needed bytes and running slower than no coalescing at all. Merging only adjacent references was ~2.8x faster than the per-chunk baseline with zero over-read. `max_bytes` (default 8 MiB) bounds the size of any single read. Depends on the cross-key `Store.get_many` API in zarr-python (zarr-developers/zarr-python#4112, #4113); until that is released the method is a dormant override. See zarr-developers#947.
TomNicholas
added a commit
to TomNicholas/VirtualiZarr
that referenced
this pull request
Jul 2, 2026
Implement Store.get_many on ManifestStore: resolve the requested chunk keys through the manifests to (source file, byte range), group the requests by source file, coalesce each group into runs, and serve each run with a single ranged read that is sliced back into per-chunk buffers. Keys that are not plain manifest-backed chunks (metadata, inlined, or missing chunks) are served individually via `get`. Coalescing uses two knobs (the object_store / async-tiff model): a run merges references whose gap is <= `coalesce_max_gap_bytes` as long as the resulting read stays <= `coalesce_max_bytes`. The gap defaults to 0 - merge only adjacent references, a pure win with no wasted bytes - because benchmarking a 2D map-tile query against a remote Met Office file showed that bridging larger gaps pulls in the chunks that sit between rows (a 2D box is contiguous along one axis but strided along the other), reading ~3x the needed bytes and running slower than no coalescing at all. Merging only adjacent references was ~2.8x faster than the per-chunk baseline with zero over-read. `max_bytes` (default 8 MiB) bounds the size of any single read. Depends on the cross-key `Store.get_many` API in zarr-python (zarr-developers/zarr-python#4112, #4113); until that is released the method is a dormant override. See zarr-developers#947.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a public, overridable
Store.get_manyfor fetching many values — each a whole key or a(key, byte_range)pair — in a single call, so a store can batch/coalesce reads that land in the same underlying object instead of issuing onegetper key.It generalizes
Store.get_ranges(many ranges of one key) to many keys, yielding(request_index, Buffer | None)batches in completion order. The ABC default is a concurrent fan-out overget;FsspecStoreoverrides it to coalesce via fsspec'scat_ranges. Coalescing tuning is left to each store rather than exposed on the interface.Motivation — xref #1806 (batched Store API), #1758 (request coalescing), and zarr-developers/VirtualiZarr#947 (files-as-shards / consolidating small reads): a custom store such as VirtualiZarr's
ManifestStoreor icechunk'sIcechunkStorecan overrideget_manyto merge many small chunk reads into fewer requests.Draft — feedback on the signature welcome.