Fix EP+LoRA MoE training bugs and refine cookbooks#241
Conversation
There was a problem hiding this comment.
Code Review
This pull request simplifies the LoRA configuration building process for DeepSeek-V4 and Qwen3.5-MoE, updates training scripts with new configurations, supports shared expert gating, and corrects the trainable parameter count calculation under Expert Parallel (EP) mode. Feedback highlights a critical issue in ep_fsdp2_lora_deepseek_v4_multinode.sh where a backslash is incorrectly used as a path separator, and suggests adding defensive checks when accessing the 'ep' dimension from the device mesh to prevent potential runtime errors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
PR type
PR information
shared_expert_gate(sigmoid) to shared expert output in EP path; previously the gate was skipped, producing incorrect shared-expert contributions for Qwen3.5 MoE._get_nb_trainable_parametersnow compensates for theep_sizeundercount on_ep_patchedexpert subtrees (DTensor logical shape is post-EP after_shard_tensor_experts), so reported trainable/total counts are correct instead of being off byep_size.NativeFSDPStrategystopped injecting['mlp.experts.gate_up_proj', 'mlp.experts.down_proj']; cookbooks set them explicitly inLoraConfig, removing a hidden side-effect.ep_fsdp2_lora_qwen3_5_moe.*,ep_fsdp2_lora_deepseek_v4.*): simplify_build_lora_config(always target expert params), passmax_lengthtoset_template, switch toCosineWarmupScheduler, sync--log-interval/--fsdp-size/--ep-sizevia shared shell vars, expose full hyperparameters as inline CLI flags, fix DSV4 multinode world-size arithmetic and script path, drop stale single-node DSV4 script.