Skip to main content

Custom Models & LoRA

ACE-Step supports LoRA (Low-Rank Adaptation) fine-tuning, enabling specialized music generation for specific genres, styles, or use cases. This page covers official LoRA models, training your own, and AceSteps' roadmap for custom model support.

What is LoRA?

LoRA is an efficient fine-tuning technique that adds small trainable layers to a frozen base model:

┌─────────────────────────────────────────────────────────────────────────────┐
│ LoRA ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Traditional Fine-Tuning │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Base Model (3.5B params) ──▶ Fine-tune ALL weights │ │
│ │ │ │
│ │ Problems: │ │
│ │ • Requires 50+ GB VRAM │ │
│ │ • Training takes days/weeks │ │
│ │ • Creates full model copy (14+ GB) │ │
│ │ • Easy to overfit or catastrophic forgetting │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ LoRA Fine-Tuning │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Base Model (3.5B - FROZEN) │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌───────────────────────────────────────────────────────────┐ │ │
│ │ │ Original Weight Matrix W (frozen) │ │ │
│ │ │ │ │ │ │
│ │ │ │ + LoRA Adaptation │ │ │
│ │ │ │ │ │ │
│ │ │ ┌─────┐ │ ┌─────┐ │ │ │
│ │ │ │ A │ ──┼──▶│ B │ (rank 8-64, trainable) │ │ │
│ │ │ └─────┘ │ └─────┘ │ │ │
│ │ │ (d×r) │ (r×d) │ │ │
│ │ │ │ │ │ │
│ │ │ Output = W·x + (A·B)·x │ │ │
│ │ └───────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Benefits: │ │
│ │ • Only 16-24 GB VRAM needed │ │
│ │ • Training takes hours │ │
│ │ • LoRA weights are tiny (10-100 MB) │ │
│ │ • Easy to swap/combine multiple LoRAs │ │
│ │ • Base model knowledge preserved │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

LoRA vs Full Fine-Tuning

AspectFull Fine-TuningLoRA
VRAM Required50+ GB16-24 GB
Training TimeDays/WeeksHours
Model Size14+ GB10-100 MB
Base ModelModifiedPreserved
Catastrophic ForgettingHigh RiskLow Risk
Combining StylesDifficultEasy (merge LoRAs)

Official ACE-Step LoRAs

The ACE-Step team has released several specialized LoRA models:

Released Models

LoRAPurposeSizePerformanceBest For
Lyric2VocalPure vocal generation45 MBExcellent vocalsA cappella, vocal-focused
Text2SamplesInstrumental samples38 MBClean loopsProducers, sample packs
RapMachineRap/Hip-hop52 MBChinese rap optimizedRap vocals, flow

Planned Models

LoRAPurposeStatusExpected
StemGenIndividual instrument tracksIn DevelopmentQ2 2025
Singing2AccompanimentBacking tracks from vocalsIn DevelopmentQ2 2025
EDMasterEDM/Electronic focusPlannedQ3 2025
LoFiVibesLo-fi aestheticPlannedQ3 2025

Using Official LoRAs

from acestep import ACEStepPipeline

# Load base model with Lyric2Vocal LoRA
pipe = ACEStepPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
lora_path="ACE-Step/ACE-Step-v1-Lyric2Vocal-LoRA",
torch_dtype=torch.bfloat16
)

# Generate vocal-focused content
result = pipe(
prompt="[vocals only] powerful female vocal, soul, emotional",
duration=30
)

Training Your Own LoRA

Prerequisites

RequirementMinimumRecommended
GPU VRAM16 GB24 GB
System RAM32 GB64 GB
Storage100 GB500 GB
Python3.10+3.11
CUDA11.8+12.1

Dataset Preparation

┌─────────────────────────────────────────────────────────────────────────────┐
│ DATASET STRUCTURE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ training_data/ │
│ ├── audio/ │
│ │ ├── track_001.wav (44.1kHz, stereo, 30s-240s) │
│ │ ├── track_002.wav │
│ │ ├── track_003.wav │
│ │ └── ... │
│ │ │
│ ├── annotations/ │
│ │ ├── track_001.json │
│ │ │ { │
│ │ │ "prompt": "energetic rock song with distorted guitars", │
│ │ │ "lyrics": "verse 1 lyrics here...", │
│ │ │ "tags": ["rock", "energetic", "guitar"], │
│ │ │ "bpm": 140, │
│ │ │ "key": "E minor" │
│ │ │ } │
│ │ ├── track_002.json │
│ │ └── ... │
│ │ │
│ └── metadata.csv │
│ filename,prompt,duration,genre,mood │
│ track_001.wav,"energetic rock...",180,rock,energetic │
│ track_002.wav,"chill ambient...",240,ambient,calm │
│ │
│ Dataset Requirements: │
│ • Minimum: 50 tracks (1+ hours total) │
│ • Recommended: 200+ tracks (5+ hours total) │
│ • Audio: WAV format, 44.1kHz, stereo │
│ • Annotations: Detailed prompts matching audio content │
│ • Quality: Clean recordings, consistent style │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Training Configuration

# train_config.yaml

# Model settings
base_model: "ACE-Step/ACE-Step-v1-3.5B"
output_dir: "./lora_output"

# LoRA hyperparameters
lora:
rank: 32 # Higher = more capacity, more VRAM
alpha: 64 # Scaling factor (usually 2x rank)
dropout: 0.1 # Regularization
target_modules: # Which layers to adapt
- "q_proj"
- "k_proj"
- "v_proj"
- "out_proj"
- "fc1"
- "fc2"

# Training settings
training:
epochs: 50
batch_size: 1 # Limited by VRAM
gradient_accumulation: 8
learning_rate: 1e-4
lr_scheduler: "cosine"
warmup_steps: 100
max_grad_norm: 1.0

# Data settings
data:
train_dir: "./training_data"
validation_split: 0.1
max_duration: 60 # Crop longer tracks
augmentation:
pitch_shift: true # ±2 semitones
time_stretch: true # ±10%
add_noise: false

# Hardware
precision: "bf16"
gradient_checkpointing: true

Training Script

from acestep.training import LoRATrainer, TrainingConfig
import torch

# Load configuration
config = TrainingConfig.from_yaml("train_config.yaml")

# Initialize trainer
trainer = LoRATrainer(
base_model="ACE-Step/ACE-Step-v1-3.5B",
config=config,
device="cuda"
)

# Prepare dataset
dataset = trainer.prepare_dataset("./training_data")

# Start training
trainer.train(
dataset=dataset,
epochs=config.training.epochs,
callbacks=[
# Save checkpoints every 10 epochs
SaveCheckpointCallback(every=10),
# Log to wandb
WandbCallback(project="acestep-lora"),
# Early stopping
EarlyStoppingCallback(patience=5, metric="val_loss")
]
)

# Save final LoRA
trainer.save_lora("./my_custom_lora")

Training Metrics

Monitor these metrics during training:

MetricGood RangeIndicates
train_lossDecreasingModel learning
val_lossStable/DecreasingNo overfitting
grad_norm< 1.0Stable training
learning_rateFollowing scheduleOptimizer working
audio_qualitySubjectiveListen to samples!

Evaluation

# Generate samples with trained LoRA
pipe = ACEStepPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
lora_path="./my_custom_lora"
)

# Test prompts matching your training data style
test_prompts = [
"your style prompt 1",
"your style prompt 2",
"edge case prompt",
]

for prompt in test_prompts:
result = pipe(prompt=prompt, duration=30, seed=42)
result.save(f"eval_{hash(prompt)}.mp3")

Combining Multiple LoRAs

ACE-Step supports merging multiple LoRAs for hybrid styles:

from acestep import ACEStepPipeline, merge_loras

# Merge two LoRAs with weights
merged_lora = merge_loras(
loras=[
("./lora_vocal", 0.7), # 70% vocal style
("./lora_ambient", 0.3), # 30% ambient style
],
output_path="./merged_lora"
)

# Use merged LoRA
pipe = ACEStepPipeline.from_pretrained(
"ACE-Step/ACE-Step-v1-3.5B",
lora_path="./merged_lora"
)

Merge Strategies

StrategyUse CaseExample
Weighted AverageBlend styles0.5 rock + 0.5 electronic
Task ArithmeticAdd capabilitiesbase + vocals + production
TIES MergeReduce conflictsMultiple specialized LoRAs

AceSteps Custom Model Roadmap

Phase 1: Official LoRA Selection (Current)

┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: OFFICIAL LORAS │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Creator Studio UI │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ Model Selection: │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ ○ Base ACE-Step (General Purpose) │ │ │
│ │ │ ○ Lyric2Vocal (Vocal Focus) │ │ │
│ │ │ ○ Text2Samples (Instrumental) │ │ │
│ │ │ ○ RapMachine (Rap/Hip-hop) │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ │ │
│ │ Status: ✓ Available now │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 2: COMMUNITY GALLERY │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Features: │
│ • Upload trained LoRAs to AceSteps │
│ • Community ratings and reviews │
│ • Quality moderation by team │
│ • Revenue sharing for popular models │
│ │
│ Model Card Example: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ 🎸 RetroSynth80s ★★★★☆ (4.2) │ │
│ │ by @synthmaster │ │
│ │ │ │
│ │ 80s synthwave and retrowave style. Perfect for │ │
│ │ nostalgic, neon-lit soundscapes. │ │
│ │ │ │
│ │ Downloads: 1,234 | Size: 48 MB | Base: v1-3.5B │ │
│ │ │ │
│ │ [Try Now] [Download] [View Samples] │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Status: 🚧 In Development │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Phase 3: In-Platform Training (Q4 2025)

┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 3: IN-PLATFORM TRAINING │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Features: │
│ • Upload your audio dataset directly │
│ • Automatic annotation assistance │
│ • Cloud GPU training (no local hardware needed) │
│ • Training dashboard with metrics │
│ • One-click deployment to your generations │
│ │
│ Training Dashboard: │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ My LoRA Training: "JazzFusion" │ │
│ │ │ │
│ │ Progress: ████████████░░░░░░░░ 62% │ │
│ │ Epoch: 31/50 │ │
│ │ Loss: 0.0234 ↓ │ │
│ │ ETA: 2h 15m │ │
│ │ │ │
│ │ [Pause] [Cancel] [View Samples] │ │
│ │ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
│ Status: 📋 Planned │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Best Practices

Dataset Quality

DoDon't
Use high-quality recordingsInclude noisy/clipped audio
Keep consistent styleMix unrelated genres
Write detailed promptsUse vague descriptions
Include 50+ diverse tracksTrain on < 20 samples
Validate audio formatsMix sample rates

Training Tips

  1. Start with lower rank (16-32) and increase if underfitting
  2. Monitor validation loss - stop if it increases while train loss decreases
  3. Listen to samples every 10 epochs - metrics don't tell the full story
  4. Save checkpoints - you can always go back to earlier versions
  5. Test edge cases - prompts outside your training distribution

Common Issues

IssueCauseSolution
OverfittingToo many epochsEarly stopping, more data
UnderfittingRank too lowIncrease LoRA rank
Mode collapseLearning rate too highReduce LR, add warmup
Quality degradationBad dataClean dataset, filter outliers
CUDA OOMBatch too largeReduce batch, use gradient accumulation

Resources

Official Documentation

Community Resources