Skip to content

[llama4] Add weight-only int4 linear quantization for backbone export#20644

Open
SS-JIA wants to merge 2 commits into
gh/SS-JIA/565/basefrom
gh/SS-JIA/565/head
Open

[llama4] Add weight-only int4 linear quantization for backbone export#20644
SS-JIA wants to merge 2 commits into
gh/SS-JIA/565/basefrom
gh/SS-JIA/565/head

Conversation

@SS-JIA

@SS-JIA SS-JIA commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Stack from ghstack (oldest at bottom):

Adds a --weight-only-linear export option that quantizes the TISO backbone's linears to weight-only int4 (dequantized int4 weights, fp activations) instead of the default 8da4w (dynamic int8 activations + int4 weights).

Adds a WeightOnlyInt4Linear module and _replace_linear_with_linear_int4_weight_only_for_pre_quantization in examples/models/llama/source_transformation/pre_quantization.py. It uses the same buffer layout as Int8DynActInt4WeightLinear (int8 weight holding int4 values, per-group scales / zeros), so an existing QAT checkpoint loads unchanged; its forward dequantizes the weight and runs a plain F.linear with no per-token activation quantization. transform_linear_for_pre_quantization gains a weight_only flag, threaded through the llama4 backbone build (backbone/model.py) and exposed as --weight-only-linear on export_llm_backbone.py.

Motivation: the 8da4w dynamic-activation-quant path triggers a Mali-GPU-specific numerical bug in the ET-VK linear_dq8ca_q4gsw kernel (negative per-token activation zero-point), producing garbled, runaway generation on Mali-G710 / G715. Weight-only int4 lowers to et_vk.linear_q4gsw (fp activation x int4 weight), bypassing the dq8ca path entirely and running correctly on Mali. This is a working Mali path / client workaround while the dq8ca kernel bug is fixed separately.

Accuracy note: the TISO checkpoint was QAT-trained for 8da4w, so weight-only inference may be slightly less accurate than an accuracy-optimal weight-only-QAT model.

Differential Revision: D110111051

[ghstack-poisoned]
@pytorch-bot

pytorch-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20644

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7280475 with merge base 9f31912 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 30, 2026
@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]
SS-JIA added a commit that referenced this pull request Jun 30, 2026
Pull Request resolved: #20644

Adds a `--weight-only-linear` export option that quantizes the TISO backbone's linears to weight-only int4 (dequantized int4 weights, fp activations) instead of the default 8da4w (dynamic int8 activations + int4 weights).

Adds a `WeightOnlyInt4Linear` module and `_replace_linear_with_linear_int4_weight_only_for_pre_quantization` in `examples/models/llama/source_transformation/pre_quantization.py`. It uses the same buffer layout as `Int8DynActInt4WeightLinear` (int8 `weight` holding int4 values, per-group `scales` / `zeros`), so an existing QAT checkpoint loads unchanged; its forward dequantizes the weight and runs a plain `F.linear` with no per-token activation quantization. `transform_linear_for_pre_quantization` gains a `weight_only` flag, threaded through the llama4 backbone build (`backbone/model.py`) and exposed as `--weight-only-linear` on `export_llm_backbone.py`.

Motivation: the 8da4w dynamic-activation-quant path triggers a Mali-GPU-specific numerical bug in the ET-VK `linear_dq8ca_q4gsw` kernel (negative per-token activation zero-point), producing garbled, runaway generation on Mali-G710 / G715. Weight-only int4 lowers to `et_vk.linear_q4gsw` (fp activation x int4 weight), bypassing the dq8ca path entirely and running correctly on Mali. This is a working Mali path / client workaround while the dq8ca kernel bug is fixed separately.

Accuracy note: the TISO checkpoint was QAT-trained for 8da4w, so weight-only inference may be slightly less accurate than an accuracy-optimal weight-only-QAT model.
ghstack-source-id: 398701430
@exported-using-ghexport

Differential Revision: [D110111051](https://our.internmc.facebook.com/intern/diff/D110111051/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant