Skip to content

[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584

Open
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/76/basefrom
gh/JulianCloudNTH/76/head
Open

[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/76/basefrom
gh/JulianCloudNTH/76/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Test coverage for the 2D dispatch fold, stacked above the cap-lift op.

Problem: The 2D fold is load-bearing index math — a wrong {x, y} means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

Solution: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.

  • Before: no coverage for >65535-workgroup dispatch; llama1b_prefill_512/_2048 shapes threw at the cap
  • After: fold_workgroup_count_2d unit-tested at the cap boundaries, and the two prefill shapes run as goldens

Implementation:

  • test/native/test_dispatch_2d.cpp — device-free unit test for utils::fold_workgroup_count_2d: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 ({65535, 3}) and S=2048 ({65535, 33}), and the needs-3rd-dimension throw; asserts each {x, y} covers [0, count)
  • llama1b_prefill_512 + llama1b_prefill_2048 configs appended to the byte-mirrored CONFIGS (test_sdpa.py) and kSdpaConfigs (test_webgpu_native.cpp)
  • Registers webgpu_dispatch_2d_test in CMake + the native CI script

Constraints:

  • The Python/C++ config entries byte-mirror each other (kept in sync)
  • add shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element add fold case is omitted as disproportionate

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109517683

Differential Revision: D109517683

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20584

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit d5e7d33 with merge base db5f1ba (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026
@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 4m 15s —— View job


Review — PR #20584 (2D compute dispatch tests)

  • Read PR diff & changed files
  • Inspect fold_workgroup_count_2d implementation
  • Review test correctness & coverage
  • Check for modularization opportunities
  • Post review

This is a clean, well-targeted PR. The fold arithmetic in test_dispatch_2d.cpp checks out against fold_workgroup_count_2d (WebGPUUtils.h:66), the byte-mirrored Python/C++ config entries line up field-for-field, and the CMake + CI registrations are correct. Below are findings, ordered by impact.


🟡 Modularization — the device-free test links the full backend unnecessarily

This is the main opportunity. test_dispatch_2d.cpp is device-free (its own comments say so, and it touches only the header-only utils::fold_workgroup_count_2d). But it's registered via add_webgpu_native_test (CMakeLists.txt:166), which links the whole stack — webgpu_backend, ${WEBGPU_GPU_LIB} (Dawn), extension_module_static, portable_ops_lib, Metal frameworks, etc. (CMakeLists.txt:124-142).

The repo already establishes the right pattern for exactly this case: webgpu_op_test_util_test (CMakeLists.txt:186-201) is a device-free util test that deliberately does not use add_webgpu_native_test precisely to avoid the backend/Dawn link. The new fold test fits that same category and should follow it — a bare add_executable + the $<BUILD_INTERFACE:...> include dir for the webgpu/webgpu.h header, no GPU lib.

Two ways to capture this:

  • Minimal: give webgpu_dispatch_2d_test its own add_executable block mirroring webgpu_op_test_util_test, dropping the backend/GPU-lib link.
  • Most modular: fold the case into the existing webgpu_op_test_util_test (or a shared "device-free utils" target) as another GTest case, since both are pure WebGPUUtils.h/driver-util tests with no device. That also retires the hand-rolled harness (next point) and removes a CI target line.

Fix this →


🟡 Hand-rolled PASS/FAIL harness duplicates GTest

test_dispatch_2d.cpp:20-77 reimplements assertion plumbing (g_failures, expect_fold, expect_throw, manual printf + return-code aggregation). The device-free util test next door already uses GTest::gtest_main with EXPECT_EQ/EXPECT_THROW, which gives the same {x,y} diagnostics for free and a real test runner. If you adopt the "most modular" option above, this harness disappears entirely. If you keep a standalone target, it's still worth porting to GTest for consistency with the sibling device-free test.


🟢 Coverage / clarity nits (minor)

  1. 2097152 is a synthetic value, not a real config. test_dispatch_2d.cpp:74 is labeled "deep fold (large-S stress)" but doesn't correspond to any shape in CONFIGS. The 131072 case above it is well-justified (S=2048 QK = 32*512*512/64). Either tie the deep case to a concrete shape in a comment or note explicitly that it's a synthetic boundary probe so a future reader doesn't hunt for the matching config.

  2. The "covers [0, count)" claim is only half-checked. expect_fold asserts x*y >= count (no dropped threads) — good. But the PR description says it "asserts each {x,y} covers [0, count)", which also implies the shader must guard the over-covered tail (x*y - count extra threads, e.g. {65535,2} dispatches 131070 ≥ 131070 exactly, but {65535,3} for 131072 over-dispatches by 64533). The unit test correctly can't verify that guard — it lives in the shader's @builtin(num_workgroups) reconstruction and is exercised by the prefill goldens. Worth a one-line comment clarifying the unit test covers sufficiency (no gaps) while the goldens cover the tail guard, so the two halves of the contract are traceable.

  3. required=true on the prefill goldens — consistent with llama1b_decode_4k/8k (test_webgpu_native.cpp:736-737), so an absent .pte is a hard CI failure rather than a skip. Intentional and correct given the point is to exercise the fold; just flagging it's the right call.


Summary

No correctness bugs — the arithmetic, mirroring, and registrations are sound. The one change worth making is the CMake modularization: this device-free test shouldn't link the backend + Dawn. The existing webgpu_op_test_util_test is the template to follow, and folding into it (or at least matching its bare-add_executable shape and GTest usage) would also retire the hand-rolled harness. Everything else is optional polish.
· branch gh/JulianCloudNTH/76/head

[ghstack-poisoned]
JulianCloudNTH added a commit that referenced this pull request Jun 29, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 398258612
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
[ghstack-poisoned]
JulianCloudNTH added a commit that referenced this pull request Jun 30, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 398355257
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants