[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584
[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584JulianCloudNTH wants to merge 4 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20584
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d5e7d33 with merge base db5f1ba ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
|
@claude review and check for any areas or opportunities for modularization |
|
Claude finished @JulianCloudNTH's task in 4m 15s —— View job Review — PR #20584 (2D compute dispatch tests)
This is a clean, well-targeted PR. The fold arithmetic in 🟡 Modularization — the device-free test links the full backend unnecessarilyThis is the main opportunity. The repo already establishes the right pattern for exactly this case: Two ways to capture this:
🟡 Hand-rolled PASS/FAIL harness duplicates GTest
🟢 Coverage / clarity nits (minor)
SummaryNo correctness bugs — the arithmetic, mirroring, and registrations are sound. The one change worth making is the CMake modularization: this device-free test shouldn't link the backend + Dawn. The existing |
…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 398258612 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 398355257 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
Stack from ghstack (oldest at bottom):
Test coverage for the 2D dispatch fold, stacked above the cap-lift op.
Problem: The 2D fold is load-bearing index math — a wrong
{x, y}means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.Solution: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
llama1b_prefill_512/_2048shapes threw at the capfold_workgroup_count_2dunit-tested at the cap boundaries, and the two prefill shapes run as goldensImplementation:
test/native/test_dispatch_2d.cpp— device-free unit test forutils::fold_workgroup_count_2d: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 ({65535, 3}) and S=2048 ({65535, 33}), and the needs-3rd-dimension throw; asserts each{x, y}covers[0, count)llama1b_prefill_512+llama1b_prefill_2048configs appended to the byte-mirroredCONFIGS(test_sdpa.py) andkSdpaConfigs(test_webgpu_native.cpp)webgpu_dispatch_2d_testin CMake + the native CI scriptConstraints:
addshares the element-form path with QK, so it is covered structurally; a dedicated >16M-elementaddfold case is omitted as disproportionateCo-authored-with: Claude Code.
@exported-using-ghexport
Differential Revision: D109517683
Differential Revision: D109517683