[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test by JulianCloudNTH · Pull Request #20584 · pytorch/executorch

JulianCloudNTH · 2026-06-28T16:22:56Z

Stack from ghstack (oldest at bottom):

Test coverage for the 2D dispatch fold, stacked above the cap-lift op.

Problem: The 2D fold is load-bearing index math — a wrong {x, y} means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

Solution: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.

Before: no coverage for >65535-workgroup dispatch; llama1b_prefill_512/_2048 shapes threw at the cap
After: fold_workgroup_count_2d unit-tested at the cap boundaries, and the two prefill shapes run as goldens

Implementation:

test/native/test_dispatch_2d.cpp — device-free unit test for utils::fold_workgroup_count_2d: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 ({65535, 3}) and S=2048 ({65535, 33}), and the needs-3rd-dimension throw; asserts each {x, y} covers [0, count)
llama1b_prefill_512 + llama1b_prefill_2048 configs appended to the byte-mirrored CONFIGS (test_sdpa.py) and kSdpaConfigs (test_webgpu_native.cpp)
Registers webgpu_dispatch_2d_test in CMake + the native CI script

Constraints:

The Python/C++ config entries byte-mirror each other (kept in sync)
add shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element add fold case is omitted as disproportionate

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109517683

[ghstack-poisoned]

pytorch-bot · 2026-06-28T16:23:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20584

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d5e7d33 with merge base db5f1ba ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-28T16:23:39Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-29T20:21:09Z

@claude review and check for any areas or opportunities for modularization

claude · 2026-06-29T20:21:50Z

Claude finished @JulianCloudNTH's task in 4m 15s —— View job

Review — PR #20584 (2D compute dispatch tests)

Read PR diff & changed files
Inspect fold_workgroup_count_2d implementation
Review test correctness & coverage
Check for modularization opportunities
Post review

This is a clean, well-targeted PR. The fold arithmetic in test_dispatch_2d.cpp checks out against fold_workgroup_count_2d (WebGPUUtils.h:66), the byte-mirrored Python/C++ config entries line up field-for-field, and the CMake + CI registrations are correct. Below are findings, ordered by impact.

🟡 Modularization — the device-free test links the full backend unnecessarily

This is the main opportunity. test_dispatch_2d.cpp is device-free (its own comments say so, and it touches only the header-only utils::fold_workgroup_count_2d). But it's registered via add_webgpu_native_test (CMakeLists.txt:166), which links the whole stack — webgpu_backend, ${WEBGPU_GPU_LIB} (Dawn), extension_module_static, portable_ops_lib, Metal frameworks, etc. (CMakeLists.txt:124-142).

The repo already establishes the right pattern for exactly this case: webgpu_op_test_util_test (CMakeLists.txt:186-201) is a device-free util test that deliberately does not use add_webgpu_native_test precisely to avoid the backend/Dawn link. The new fold test fits that same category and should follow it — a bare add_executable + the $<BUILD_INTERFACE:...> include dir for the webgpu/webgpu.h header, no GPU lib.

Two ways to capture this:

Minimal: give webgpu_dispatch_2d_test its own add_executable block mirroring webgpu_op_test_util_test, dropping the backend/GPU-lib link.
Most modular: fold the case into the existing webgpu_op_test_util_test (or a shared "device-free utils" target) as another GTest case, since both are pure WebGPUUtils.h/driver-util tests with no device. That also retires the hand-rolled harness (next point) and removes a CI target line.

Fix this →

🟡 Hand-rolled PASS/FAIL harness duplicates GTest

test_dispatch_2d.cpp:20-77 reimplements assertion plumbing (g_failures, expect_fold, expect_throw, manual printf + return-code aggregation). The device-free util test next door already uses GTest::gtest_main with EXPECT_EQ/EXPECT_THROW, which gives the same {x,y} diagnostics for free and a real test runner. If you adopt the "most modular" option above, this harness disappears entirely. If you keep a standalone target, it's still worth porting to GTest for consistency with the sibling device-free test.

🟢 Coverage / clarity nits (minor)

2097152 is a synthetic value, not a real config. test_dispatch_2d.cpp:74 is labeled "deep fold (large-S stress)" but doesn't correspond to any shape in CONFIGS. The 131072 case above it is well-justified (S=2048 QK = 32*512*512/64). Either tie the deep case to a concrete shape in a comment or note explicitly that it's a synthetic boundary probe so a future reader doesn't hunt for the matching config.
The "covers [0, count)" claim is only half-checked. expect_fold asserts x*y >= count (no dropped threads) — good. But the PR description says it "asserts each {x,y} covers [0, count)", which also implies the shader must guard the over-covered tail (x*y - count extra threads, e.g. {65535,2} dispatches 131070 ≥ 131070 exactly, but {65535,3} for 131072 over-dispatches by 64533). The unit test correctly can't verify that guard — it lives in the shader's @builtin(num_workgroups) reconstruction and is exercised by the prefill goldens. Worth a one-line comment clarifying the unit test covers sufficiency (no gaps) while the goldens cover the tail guard, so the two halves of the contract are traceable.
required=true on the prefill goldens — consistent with llama1b_decode_4k/8k (test_webgpu_native.cpp:736-737), so an absent .pte is a hard CI failure rather than a skip. Intentional and correct given the point is to exercise the fold; just flagging it's the right call.

Summary

No correctness bugs — the arithmetic, mirroring, and registrations are sound. The one change worth making is the CMake modularization: this device-free test shouldn't link the backend + Dawn. The existing webgpu_op_test_util_test is the template to follow, and folding into it (or at least matching its bare-add_executable shape and GTest usage) would also retire the hand-rolled harness. Everything else is optional polish.
· branch gh/JulianCloudNTH/76/head

[ghstack-poisoned]

…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 398258612 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)

[ghstack-poisoned]

…d unit test Pull Request resolved: #20584 **Test coverage for the 2D dispatch fold, stacked above the cap-lift op.** **Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested. **Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family. - **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap - **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens **Implementation**: - `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)` - `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`) - Registers `webgpu_dispatch_2d_test` in CMake + the native CI script **Constraints**: - The Python/C++ config entries byte-mirror each other (kept in sync) - `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate Co-authored-with: Claude Code. ghstack-source-id: 398355257 @exported-using-ghexport Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)

[ghstack-poisoned]

Update

1f4f899

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 28, 2026 16:22

JulianCloudNTH temporarily deployed to cadence June 28, 2026 16:23 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

Update

ebe9569

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 29, 2026

Update

7842c81

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive

Update

d5e7d33

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 21:10 — with GitHub Actions Inactive

This was referenced Jun 30, 2026

[ExecuTorch][WebGPU] 2D-fold mul + permute dispatch (lift 65535 1D cap) #20651

Open

[ExecuTorch][WebGPU] Use requiredFeatures instance API on native + emscripten Dawn #20652

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584

[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/76/basefrom
gh/JulianCloudNTH/76/head

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JulianCloudNTH commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20584

✅ No Failures

Uh oh!

github-actions Bot commented Jun 28, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review — PR #20584 (2D compute dispatch tests)

🟡 Modularization — the device-free test links the full backend unnecessarily

🟡 Hand-rolled PASS/FAIL harness duplicates GTest

🟢 Coverage / clarity nits (minor)

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 29, 2026 •

edited

Loading