Fix EP+LoRA MoE training bugs and refine cookbooks by kevssim · Pull Request #241 · modelscope/twinkle

kevssim · 2026-07-02T06:28:01Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

Shared expert gate: apply shared_expert_gate (sigmoid) to shared expert output in EP path; previously the gate was skipped, producing incorrect shared-expert contributions for Qwen3.5 MoE.
Trainable parameter counting under EP: _get_nb_trainable_parameters now compensates for the ep_size undercount on _ep_patched expert subtrees (DTensor logical shape is post-EP after _shard_tensor_experts), so reported trainable/total counts are correct instead of being off by ep_size.
target_parameters no longer auto-filled by strategy: NativeFSDPStrategy stopped injecting ['mlp.experts.gate_up_proj', 'mlp.experts.down_proj']; cookbooks set them explicitly in LoraConfig, removing a hidden side-effect.
Cookbook updates (ep_fsdp2_lora_qwen3_5_moe.*, ep_fsdp2_lora_deepseek_v4.*): simplify _build_lora_config (always target expert params), pass max_length to set_template, switch to CosineWarmupScheduler, sync --log-interval/--fsdp-size/--ep-size via shared shell vars, expose full hyperparameters as inline CLI flags, fix DSV4 multinode world-size arithmetic and script path, drop stale single-node DSV4 script.

gemini-code-assist

Code Review

This pull request simplifies the LoRA configuration building process for DeepSeek-V4 and Qwen3.5-MoE, updates training scripts with new configurations, supports shared expert gating, and corrects the trainable parameter count calculation under Expert Parallel (EP) mode. Feedback highlights a critical issue in ep_fsdp2_lora_deepseek_v4_multinode.sh where a backslash is incorrectly used as a path separator, and suggests adding defensive checks when accessing the 'ep' dimension from the device mesh to prevent potential runtime errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

kevssim added 2 commits June 30, 2026 17:12

WIP

b5be3f8

cookbook

c37885c

gemini-code-assist Bot reviewed Jul 2, 2026

View reviewed changes

Comment thread cookbook/transformers/ep_fsdp2_lora_deepseek_v4_multinode.sh Outdated

Comment thread src/twinkle/model/transformers/transformers.py

fix

7513ef5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix EP+LoRA MoE training bugs and refine cookbooks#241

Fix EP+LoRA MoE training bugs and refine cookbooks#241
kevssim wants to merge 3 commits into
modelscope:mainfrom
kevssim:fix_qwen3_6_ep

kevssim commented Jul 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

kevssim commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kevssim commented Jul 2, 2026 •

edited

Loading