Skip to content

[ExecuTorch][WebGPU] Dynamic resize hooks for add and mul#20577

Open
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/69/basefrom
gh/JulianCloudNTH/69/head
Open

[ExecuTorch][WebGPU] Dynamic resize hooks for add and mul#20577
JulianCloudNTH wants to merge 4 commits into
gh/JulianCloudNTH/69/basefrom
gh/JulianCloudNTH/69/head

Conversation

@JulianCloudNTH

@JulianCloudNTH JulianCloudNTH commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Make the elementwise add and mul ops serve any live shape from one graph.

Problem: aten.add.Tensor and aten.mul.Tensor baked their element count + param UBO(s) + output shape at build() for the max shape. On a dynamic-shape graph at a smaller live shape they would over-dispatch and leave the output sized at the max.

Solution:

  • Before: one fixed dispatch sized for the build-time shape.
  • After: each registers a resize hook on BOTH operands (the dynamic one may be either operand by arg order). The hook recomputes the live element count, rewrites the param UBO(s), updates the dispatch workgroup_count_x, and sets the output cur_dims. Inert until an operand is resized.

Implementation:

  • add: out follows the larger operand (robust when one input is a static residual and the other is the dynamic-S tensor); rewrites AddParams.
  • mul: recomputes the broadcast output shape and rebuilds all three TensorMeta UBOs via fill_tensor_meta_broadcast.
  • Each keeps its uniform buffer(s) alive via own_uniform_buffer instead of releasing at build.
  • Mirrors Vulkan per-op resize_*_node (recompute sizes + dispatch each execute).

Constraints: Behavior-neutral on static graphs (the hook fires only when an operand's live shape differs from the max). No kernel/WGSL/numerics change.

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109906093

Differential Revision: D109906093

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20577

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit 8ea5c50 with merge base db5f1ba (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@JulianCloudNTH

Copy link
Copy Markdown
Contributor Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Claude finished @JulianCloudNTH's task in 2m 16s —— View job


Review: Dynamic resize hooks for add and mul

  • Read both changed files in full
  • Inspect WebGPUGraph resize-hook API + shared helpers (TensorMeta.h, WebGPUUtils.h)
  • Compare add vs mul resize logic for correctness
  • Identify modularization opportunities

The change is correct and well-scoped. The hooks key off cur_dims and set_cur_dims only re-dirties on an actual change, so the static path stays inert exactly as advertised, and propagate_resize converges. Ownership handoff (own_uniform_buffer instead of wgpuBufferRelease) is right — the dtor frees them, and the bind group still holds its own ref. The pattern matches the existing rms_norm resize hook, so it reads consistently with the rest of the backend.

A few correctness notes and the modularization opportunities you asked about.

Correctness / behavior notes

  1. Both operands dirty → hook fires twice in one pass. Both add and mul register the same closure on in1_id and in2_id. In propagate_resize (WebGPUGraph.cpp:205), a single pass iterates all hooks against the processing set, so if both operands are resized in the same pass the closure runs twice (two redundant wgpuQueueWriteBuffer calls + workgroup recompute). It's idempotent so results are correct — just wasted work. In practice usually only one operand is dynamic, so this rarely triggers, but worth a one-line comment acknowledging it.

  2. add output = larger-by-numel operand is a heuristic, not broadcasting. The add kernel is pure elementwise over num_elements with no per-tensor stride/broadcast metadata, so add cannot broadcast mismatched shapes (e.g. [3,1] + [1,4]) — only the "one operand is the full shape, the other is equal or a scalar-ish subset" case works. Picking n2 > n1 ? d2 : d1 preserves the static build's behavior, so this PR doesn't regress anything, but it inherits that latent limitation. Fine to leave; flagging so it's a conscious choice.

  3. Uninitialized WebGPUTensor temporaries in the mul hook (mul/BinaryOp.cpp:187-190). ta/tb/to are default-constructed and only .dims is set; fill_tensor_meta_broadcast reads only .dims, so this is safe today — but it's an implicit coupling to that function's internals. See the overload suggestion below, which removes the temporaries entirely.

Modularization opportunities

These two *_impl functions now share three reusable primitives that don't currently exist as helpers:

(a) Broadcast output-shape computation. The right-alignment loop in mul/BinaryOp.cpp:178-184 is a standalone PyTorch-broadcast rule with no dependency on WebGPU types. It belongs next to fill_tensor_meta_broadcast in TensorMeta.h:

// TensorMeta.h
inline std::vector<int64_t> broadcast_out_dims(
    const std::vector<int64_t>& a, const std::vector<int64_t>& b) {
  const size_t r = std::max(a.size(), b.size());
  std::vector<int64_t> out(r, 1);
  for (size_t i = 0; i < r; i++) {
    const int64_t av = (i + a.size() < r) ? 1 : a[i - (r - a.size())];
    const int64_t bv = (i + b.size() < r) ? 1 : b[i - (r - b.size())];
    out[i] = av > bv ? av : bv;
  }
  return out;
}

Fix this →

(b) A fill_tensor_meta_broadcast overload taking const std::vector<int64_t>& so the mul hook stops fabricating WebGPUTensor temporaries:

inline void fill_tensor_meta_broadcast(
    const std::vector<int64_t>& dims, uint32_t out_ndim, TensorMeta* m);
// existing WebGPUTensor overload forwards to it via t.dims

This removes ta/tb/to and note (3) above. Fix this →

(c) "Register the same hook on both operands." Both ops do the identical two-line dance:

graph.add_tensor_resize_hook(in1_id, fn);
graph.add_tensor_resize_hook(in2_id, fn);

A small overload add_tensor_resize_hook(std::initializer_list<int> ids, fn) on WebGPUGraph would document intent and also let propagate_resize dedup so the hook fires once even when both operands are dirty (fixes note 1). Lower priority than (a)/(b).

On a full binary-op base class: I'd not go further than (a)–(b) right now. add (single AddParams UBO, numel-max, no broadcast metadata) and mul (three TensorMeta UBOs, full broadcast) differ enough in their bodies that a shared binary_op_impl template would need so many hooks it'd be harder to read than the two explicit functions — which cuts against the repo's "choose simpler, more concise" guidance. Extracting the genuinely shared data helpers (a/b) is the right altitude; collapsing the control flow is not.

Summary

No blocking issues — correct and behavior-neutral on static graphs. Recommended: (a) and (b) to remove the duplicated broadcast math and the fabricated tensor temporaries; (c) optional. Notes 1–2 are worth a comment but need no code change.
· branch gh/JulianCloudNTH/69/head

[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants