[llama4] Add weight-only int4 linear quantization for backbone export by SS-JIA · Pull Request #20644 · pytorch/executorch

SS-JIA · 2026-06-30T20:44:56Z

Stack from ghstack (oldest at bottom):

-> [llama4] Add weight-only int4 linear quantization for backbone export #20644

Adds a --weight-only-linear export option that quantizes the TISO backbone's linears to weight-only int4 (dequantized int4 weights, fp activations) instead of the default 8da4w (dynamic int8 activations + int4 weights).

Adds a WeightOnlyInt4Linear module and _replace_linear_with_linear_int4_weight_only_for_pre_quantization in examples/models/llama/source_transformation/pre_quantization.py. It uses the same buffer layout as Int8DynActInt4WeightLinear (int8 weight holding int4 values, per-group scales / zeros), so an existing QAT checkpoint loads unchanged; its forward dequantizes the weight and runs a plain F.linear with no per-token activation quantization. transform_linear_for_pre_quantization gains a weight_only flag, threaded through the llama4 backbone build (backbone/model.py) and exposed as --weight-only-linear on export_llm_backbone.py.

Motivation: the 8da4w dynamic-activation-quant path triggers a Mali-GPU-specific numerical bug in the ET-VK linear_dq8ca_q4gsw kernel (negative per-token activation zero-point), producing garbled, runaway generation on Mali-G710 / G715. Weight-only int4 lowers to et_vk.linear_q4gsw (fp activation x int4 weight), bypassing the dq8ca path entirely and running correctly on Mali. This is a working Mali path / client workaround while the dq8ca kernel bug is fixed separately.

Accuracy note: the TISO checkpoint was QAT-trained for 8da4w, so weight-only inference may be slightly less accurate than an accuracy-optimal weight-only-QAT model.

Differential Revision: D110111051

[ghstack-poisoned]

pytorch-bot · 2026-06-30T20:45:01Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20644

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7280475 with merge base 9f31912 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-30T20:45:48Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

Pull Request resolved: #20644 Adds a `--weight-only-linear` export option that quantizes the TISO backbone's linears to weight-only int4 (dequantized int4 weights, fp activations) instead of the default 8da4w (dynamic int8 activations + int4 weights). Adds a `WeightOnlyInt4Linear` module and `_replace_linear_with_linear_int4_weight_only_for_pre_quantization` in `examples/models/llama/source_transformation/pre_quantization.py`. It uses the same buffer layout as `Int8DynActInt4WeightLinear` (int8 `weight` holding int4 values, per-group `scales` / `zeros`), so an existing QAT checkpoint loads unchanged; its forward dequantizes the weight and runs a plain `F.linear` with no per-token activation quantization. `transform_linear_for_pre_quantization` gains a `weight_only` flag, threaded through the llama4 backbone build (`backbone/model.py`) and exposed as `--weight-only-linear` on `export_llm_backbone.py`. Motivation: the 8da4w dynamic-activation-quant path triggers a Mali-GPU-specific numerical bug in the ET-VK `linear_dq8ca_q4gsw` kernel (negative per-token activation zero-point), producing garbled, runaway generation on Mali-G710 / G715. Weight-only int4 lowers to `et_vk.linear_q4gsw` (fp activation x int4 weight), bypassing the dq8ca path entirely and running correctly on Mali. This is a working Mali path / client workaround while the dq8ca kernel bug is fixed separately. Accuracy note: the TISO checkpoint was QAT-trained for 8da4w, so weight-only inference may be slightly less accurate than an accuracy-optimal weight-only-QAT model. ghstack-source-id: 398701430 @exported-using-ghexport Differential Revision: [D110111051](https://our.internmc.facebook.com/intern/diff/D110111051/)

Update

d92a7d3

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 30, 2026

SS-JIA temporarily deployed to cadence June 30, 2026 20:45 — with GitHub Actions Inactive

Update

7280475

[ghstack-poisoned]

SS-JIA temporarily deployed to cadence June 30, 2026 21:32 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[llama4] Add weight-only int4 linear quantization for backbone export#20644

[llama4] Add weight-only int4 linear quantization for backbone export#20644
SS-JIA wants to merge 2 commits into
gh/SS-JIA/565/basefrom
gh/SS-JIA/565/head

SS-JIA commented Jun 30, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

SS-JIA commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20644

✅ No Failures

Uh oh!

github-actions Bot commented Jun 30, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SS-JIA commented Jun 30, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 30, 2026 •

edited

Loading

This PR needs a `release notes:` label