Use Store.get_many for whole-chunk reads in BatchedCodecPipeline#4113
Use Store.get_many for whole-chunk reads in BatchedCodecPipeline#4113TomNicholas wants to merge 2 commits into
Conversation
Add a public, overridable `Store.get_many` that retrieves many values at once - each request being a whole key or a `(key, byte_range)` pair. It generalizes `Store.get_ranges` (many ranges of one key) to many keys, and yields `(request_index, Buffer | None)` batches in completion order so a store can coalesce reads that land in the same underlying object. The ABC default fetches requests concurrently with `get`, so every store works out of the box; stores with a bulk backend override it (`FsspecStore` coalesces via fsspec's `cat_ranges`). Coalescing tuning is left to each store rather than exposed on the interface. This restores and generalizes the batched-fetch capability of the v2 `getitems` Store API (see zarr-developersgh-1806).
BatchedCodecPipeline.read now fetches the encoded bytes for an entire (non-sharded) read with a single Store.get_many call, instead of one Store.get per chunk. It drives get_many over all chunk keys, scatters the completion-ordered (index, buffer) results back into position, and feeds them to the per-batch decode path. This lets a store batch or coalesce the underlying reads (e.g. FsspecStore via cat_ranges, or a custom store such as virtualizarr's ManifestStore / icechunk's IcechunkStore that overrides get_many) regardless of codec_pipeline.batch_size, which still governs only decode batching. The sharding codec's partial-decode path is untouched, and stores without a specialized get_many fall back to the previous concurrent per-chunk gets.
d8a292d to
4f1ad9f
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4113 +/- ##
=======================================
Coverage 93.50% 93.51%
=======================================
Files 90 90
Lines 11981 12051 +70
=======================================
+ Hits 11203 11269 +66
- Misses 778 782 +4
🚀 New features to boost your workflow:
|
|
Can you highlight what the relation of this to #3925 is? |
|
Yes good question - For my use case (fetching virtual chunks that are separate keys but happen to be part of the same netCDF file object) I need a method for bulk fetching multiple keys. Then my store ( |
| ) -> AsyncIterator[Sequence[tuple[int, Buffer | None]]]: | ||
| """Retrieve many values, possibly from different keys, at once. | ||
|
|
||
| This is the bulk counterpart to :meth:`get`: the whole set of requests |
There was a problem hiding this comment.
rst-style docstring -> mkdocs-style docstring
| """ | ||
| # Local imports to avoid an import cycle at module load time. | ||
| from zarr.core.common import concurrent_map | ||
| from zarr.core.config import config |
There was a problem hiding this comment.
what if the concurrency is a plain keyword-only parameter for this function?
Ah nice! Now I get what you mean by "is handed to the store in a single call, so an implementation can fetch..." |
Builds on #4112.
BatchedCodecPipeline.readnow fetches a whole (non-sharded) request with a singleStore.get_manycall instead of onegetper chunk, so a store can batch/coalesce the underlying reads — independently ofcodec_pipeline.batch_size, which still governs only decode batching.The sharding codec's partial-decode path is unchanged, and stores without a specialized
get_manyfall back to the previous concurrent per-chunk behavior.Motivation — xref #1758 (request coalescing), #1806 (batched Store API), and zarr-developers/VirtualiZarr#947 (files-as-shards / consolidating small reads).
Stacked on #4112 — its commit is the first one here; review after it. Draft.