Skip to content

Support multi-LoRA training with EP + FSDP2#236

Draft
EvineR666 wants to merge 24 commits into
modelscope:mainfrom
kevssim:ep_multilora
Draft

Support multi-LoRA training with EP + FSDP2#236
EvineR666 wants to merge 24 commits into
modelscope:mainfrom
kevssim:ep_multilora

Conversation

@EvineR666

@EvineR666 EvineR666 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

PR type

  • Bug Fix
  • [√] New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

Background
Twinkle currently supports single-adapter EP + LoRA training on packed MoE expert weights (gate_up_proj / down_proj) via PEFT's target_parameters interface. The MultiLoRA framework enables multi-tenant adapter deployment but only supports target_modules-based LoRA (attached at nn.Module layer level), not target_parameters (raw Parameter tensors). PEFT does not natively support multiple adapters on target_parameters, creating a gap for multi-tenant LoRA in EP scenarios.

This PR
This PR introduces multi-LoRA training under EP + FSDP2 by extending MultiLoRA with a target_parameters multi-slot path, enabling direct attachment of tenant adapters to packed MoE expert weights. Key changes include physical slot allocation and tenant mapping, FSDP2 sharding compatibility, and preserved single-tenant activation semantics. This unifies MultiLoRA support across both LoRA attachment paradigms, enabling efficient multi-tenant fine-tuning of MoE models under EP + FSDP2.

Experiment results

Training loss curves for two tenants on DeepSeek-V4-Flash:
9ea7ea6d004e218e13b29f8bf7e0fdca

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for DeepSeek-V4 EP Multi-LoRA target parameters in MultiLoraTransformersModel, allowing multiple target-parameter LoRA adapters to reside in memory while activating only one at a time. The feedback highlights several critical issues in the target-parameter manager: a shape mismatch error in reset_slot when expert parallel is enabled due to unsharded initial weights, significant memory overhead from cloning the entire target parameter instead of just storing its ndim, and a broadcasting shape mismatch when computing delta weights for 2D parameters. Additionally, a potential NameError was identified in the new cookbook when resuming from checkpoints if the adapter list is empty.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/twinkle/model/multi_lora_target_parameters.py
Comment thread src/twinkle/model/multi_lora_target_parameters.py Outdated
Comment thread src/twinkle/model/multi_lora_target_parameters.py Outdated
Comment thread cookbook/transformers/ep_fsdp2_multi_lora_deepseek_v4.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants