Support multi-LoRA training with EP + FSDP2#236
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements support for DeepSeek-V4 EP Multi-LoRA target parameters in MultiLoraTransformersModel, allowing multiple target-parameter LoRA adapters to reside in memory while activating only one at a time. The feedback highlights several critical issues in the target-parameter manager: a shape mismatch error in reset_slot when expert parallel is enabled due to unsharded initial weights, significant memory overhead from cloning the entire target parameter instead of just storing its ndim, and a broadcasting shape mismatch when computing delta weights for 2D parameters. Additionally, a potential NameError was identified in the new cookbook when resuming from checkpoints if the adapter list is empty.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
PR type
PR information
Background
Twinkle currently supports single-adapter EP + LoRA training on packed MoE expert weights (gate_up_proj / down_proj) via PEFT's target_parameters interface. The MultiLoRA framework enables multi-tenant adapter deployment but only supports target_modules-based LoRA (attached at nn.Module layer level), not target_parameters (raw Parameter tensors). PEFT does not natively support multiple adapters on target_parameters, creating a gap for multi-tenant LoRA in EP scenarios.
This PR
This PR introduces multi-LoRA training under EP + FSDP2 by extending MultiLoRA with a target_parameters multi-slot path, enabling direct attachment of tenant adapters to packed MoE expert weights. Key changes include physical slot allocation and tenant mapping, FSDP2 sharding compatibility, and preserved single-tenant activation semantics. This unifies MultiLoRA support across both LoRA attachment paradigms, enabling efficient multi-tenant fine-tuning of MoE models under EP + FSDP2.
Experiment results
Training loss curves for two tenants on DeepSeek-V4-Flash:
