Skip to content

Fix EP+LoRA MoE training bugs and refine cookbooks#241

Open
kevssim wants to merge 3 commits into
modelscope:mainfrom
kevssim:fix_qwen3_6_ep
Open

Fix EP+LoRA MoE training bugs and refine cookbooks#241
kevssim wants to merge 3 commits into
modelscope:mainfrom
kevssim:fix_qwen3_6_ep

Conversation

@kevssim

@kevssim kevssim commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

  • Shared expert gate: apply shared_expert_gate (sigmoid) to shared expert output in EP path; previously the gate was skipped, producing incorrect shared-expert contributions for Qwen3.5 MoE.
  • Trainable parameter counting under EP: _get_nb_trainable_parameters now compensates for the ep_size undercount on _ep_patched expert subtrees (DTensor logical shape is post-EP after _shard_tensor_experts), so reported trainable/total counts are correct instead of being off by ep_size.
  • target_parameters no longer auto-filled by strategy: NativeFSDPStrategy stopped injecting ['mlp.experts.gate_up_proj', 'mlp.experts.down_proj']; cookbooks set them explicitly in LoraConfig, removing a hidden side-effect.
  • Cookbook updates (ep_fsdp2_lora_qwen3_5_moe.*, ep_fsdp2_lora_deepseek_v4.*): simplify _build_lora_config (always target expert params), pass max_length to set_template, switch to CosineWarmupScheduler, sync --log-interval/--fsdp-size/--ep-size via shared shell vars, expose full hyperparameters as inline CLI flags, fix DSV4 multinode world-size arithmetic and script path, drop stale single-node DSV4 script.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request simplifies the LoRA configuration building process for DeepSeek-V4 and Qwen3.5-MoE, updates training scripts with new configurations, supports shared expert gating, and corrects the trainable parameter count calculation under Expert Parallel (EP) mode. Feedback highlights a critical issue in ep_fsdp2_lora_deepseek_v4_multinode.sh where a backslash is incorrectly used as a path separator, and suggests adding defensive checks when accessing the 'ep' dimension from the device mesh to prevent potential runtime errors.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread cookbook/transformers/ep_fsdp2_lora_deepseek_v4_multinode.sh Outdated
Comment thread src/twinkle/model/transformers/transformers.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant