Skip to main content

ACE-Step AI

ACE-Step is the AI model powering music generation in AceSteps.

Overview

ACE-Step is an open-source AI music generation model that creates professional-quality audio from text descriptions.

Key Specifications

SpecificationValue
Model Size3.5B parameters
LicenseApache 2.0 (full commercial use)
Generation Speed~4.5 seconds for 30 seconds
Max Duration240 seconds
Output QualityProfessional-grade audio

Why ACE-Step?

Open Source

  • Apache 2.0 License - Full commercial use rights
  • Self-Hostable - No vendor lock-in
  • Transparent - Model weights are publicly available

Quality

  • Professional output quality
  • Multiple genre support
  • Natural instrument sounds
  • Coherent song structure

Speed

  • ~4.5 seconds for 30 seconds of music
  • Serverless GPU inference via Modal
  • Scalable for high demand

How It Works

Generation Pipeline

User Prompt → Text Processing → ACE-Step Model → Audio Generation → Output
↓ ↓ ↓ ↓ ↓
"chill lofi" Tokenization Neural Network Waveform MP3/WAV
on A10G GPU Synthesis

Backend Architecture

# Simplified generation flow
async def generate_music(prompt: str, duration: int):
# 1. Queue job in database
job = await create_job(prompt, duration)

# 2. Trigger GPU inference on Modal
audio = await modal_generate(prompt, duration)

# 3. Upload to storage
audio_url = await upload_to_storage(audio)

return audio_url

Capabilities

Text-to-Music

Generate complete songs from natural language:

Input: "upbeat electronic dance music with synth leads"
Output: 30-second EDM track with synthesizers

Style Control

Specify genre, mood, and instrumentation:

  • Genre: Lo-fi, EDM, Hip-hop, Rock, Classical, Ambient
  • Mood: Happy, Sad, Energetic, Calm, Dark, Uplifting
  • Instruments: Piano, Guitar, Drums, Synth, Strings

Lyrics Support (Optional)

Generate songs with AI-generated lyrics:

Input: "love song with vocals about summer nights"
Output: Song with AI-generated lyrics and melody

Integration

API Endpoint

const response = await fetch('/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
prompt: 'chill lofi beat with rain sounds',
duration: 30
})
});

const { audioUrl, audioHash } = await response.json();

Response Format

{
"id": "gen_abc123",
"status": "completed",
"audioUrl": "https://storage.acesteps.xyz/audio/gen_abc123.mp3",
"audioHash": "0x1234...abcd",
"duration": 30,
"prompt": "chill lofi beat with rain sounds"
}

Technical Details

Model Architecture

ACE-Step uses a transformer-based architecture:

  • Multi-modal text-to-audio generation
  • Attention mechanisms for temporal coherence
  • Diffusion-based audio synthesis

GPU Requirements

GPUGeneration Time (30s)
A10G~4.5 seconds
A100~3 seconds
H100~2 seconds

Output Format

PropertyValue
FormatMP3 (default), WAV
Bitrate320kbps
Sample Rate44.1kHz
ChannelsStereo

Limitations

Current Constraints

  • Generation time varies by prompt complexity
  • Some niche genres may produce inconsistent results
  • No direct artist/song imitation (by design)
  • Maximum duration of 240 seconds

Not Supported

  • Copying existing songs
  • Voice cloning
  • Specific artist styles
  • Copyrighted material reproduction

References