Skip to main content

ACE-Step AI

ACE-Step is a state-of-the-art open-source AI music generation model that powers all music creation on AceSteps. This page provides a comprehensive overview of the model's capabilities, architecture, and integration.

Why ACE-Step?

AceSteps chose ACE-Step as its AI backbone for several critical reasons:

AspectACE-StepProprietary Models
LicenseApache 2.0Restrictive/Proprietary
Commercial UseFull rightsOften limited
Training DataDisclosedOpaque
Copyright ClaimsNone possibleRisk of infringement
NFT MintingFully legalLegal gray area
Why This Matters

Every song created on AceSteps is 100% copyright-free. Creators can mint, sell, and monetize their AI-generated music without any legal risk. This is foundational to our tokenization model.

Open Source Advantages

  • Transparency - Model weights publicly available
  • Self-Hostable - No vendor lock-in
  • Community - Active development and improvements
  • Customizable - LoRA fine-tuning support
  • Auditable - Training methodology documented

Model Specifications

Core Parameters

SpecificationValue
Model NameACE-Step-v1-3.5B
Parameters3.5 billion
ArchitectureDiffusion + Linear Transformer
LicenseApache 2.0
DevelopersACE Studio & StepFun
Release2025

Audio Output

PropertyValue
FormatMP3 (default), WAV
Bitrate320 kbps
Sample Rate44.1 kHz
ChannelsStereo
Max Duration240 seconds (4 minutes)
QualityProfessional-grade

Generation Speed

Real-Time Factor (RTF) measures how fast the model generates audio. Higher RTF = faster generation.

GPURTF (27 steps)Time for 1 minRTF (60 steps)Time for 1 min
RTX 409034.48x1.74s15.63x3.84s
A10027.27x2.20s12.27x4.89s
RTX 309012.76x4.70s6.48x9.26s
A10G (AceSteps)~20x~3.0s~10x~6.0s
M2 Max2.27x26.43s1.03x58.25s
AceSteps Performance

On our Modal A10G infrastructure, generating 30 seconds of music takes approximately 4-5 seconds. This enables near-instant previews for creators.

Language Support

ACE-Step supports 19 languages for lyrics and vocal generation:

Tier 1 - Excellent Performance

LanguageCodeVocal QualityLyric Alignment
EnglishenExcellentExcellent
ChinesezhExcellentExcellent
JapanesejaExcellentVery Good
KoreankoVery GoodVery Good

Tier 2 - Good Performance

LanguageCodeVocal QualityLyric Alignment
SpanishesVery GoodGood
GermandeVery GoodGood
FrenchfrGoodGood
PortugueseptGoodGood
ItalianitGoodGood
RussianruGoodGood

Tier 3 - Experimental

Other supported languages may have reduced quality due to training data imbalance. Performance varies by genre and complexity.

Generation Capabilities

Text-to-Music

Generate complete songs from natural language descriptions:

Input: "upbeat electronic dance music with energetic synth leads
and a driving four-on-the-floor beat, festival anthem style"

Output: 30-second EDM track with synthesizers, bass drops, and builds

Style Control

Genres Supported

CategoryGenres
ElectronicEDM, House, Techno, Trance, Dubstep, Ambient
Hip-HopTrap, Boom Bap, Lo-fi Hip-hop, Drill
RockAlternative, Indie, Metal, Punk, Classic Rock
PopSynth-pop, K-pop, J-pop, Dance Pop
ClassicalOrchestral, Piano, Chamber, Cinematic
WorldLatin, Afrobeat, Reggae, Folk
OtherJazz, R&B, Soul, Country, Blues

Mood Control

Energetic / Calm / Happy / Sad / Dark / Uplifting / Mysterious / Aggressive
Romantic / Nostalgic / Triumphant / Melancholic / Peaceful / Intense

Instrument Specification

Piano, Guitar (acoustic/electric), Drums, Bass, Synthesizer, Strings,
Brass, Woodwinds, Percussion, Violin, Cello, Saxophone, Trumpet, Flute

Lyrics Generation with Llama-Song-Stream-3B

AceSteps uses two open-source AI models working together:

┌─────────────────────────────────────────────────────────────────────────────┐
│ DUAL-MODEL LYRICS PIPELINE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User Prompt: "love song about summer nights by the ocean" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ LLAMA-SONG-STREAM-3B (Lyrics Generation) │ │
│ │ │ │
│ │ • Fine-tuned Llama 3.2 3B model │ │
│ │ • 57.7k lyrical training examples │ │
│ │ • Maintains rhyme, meter, thematic consistency │ │
│ │ • Apache 2.0 license │ │
│ │ │ │
│ │ Output: "Verse 1: Walking down the sandy shore..." │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Generated lyrics + original prompt │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ ACE-STEP 3.5B (Music + Vocal Synthesis) │ │
│ │ │ │
│ │ • Combines lyrics with musical generation │ │
│ │ • Synthesizes vocals matching melody │ │
│ │ • Aligns lyrics to beat and rhythm │ │
│ │ │ │
│ │ Output: Complete song with vocals │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Final Audio: Love song with AI vocals and lyrics │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Llama-Song-Stream-3B Specifications

SpecificationValue
Model NameLlama-Song-Stream-3B-Instruct
Base ModelMeta Llama 3.2 3B
Parameters3 billion
Training Data57.7k lyrical examples
LicenseApache 2.0
DeveloperprithivMLmods

Lyrics Generation Capabilities

FeatureDescription
Genre AwarenessPop, rock, rap, R&B, country, classical, etc.
Rhyme SchemesABAB, AABB, free verse, and more
Song StructureVerse, chorus, bridge, pre-chorus, outro
Thematic ControlLove, heartbreak, party, motivation, storytelling
MultilingualBest in English, supports other languages

Example Usage

Input Theme: "motivational workout anthem"
Genre: "EDM / electronic"

Generated Lyrics:
[Verse 1]
Push through the fire, feel the burn ignite
Every rep is power, every step is right
No more excuses, leave them all behind
Champions are built one rep at a time

[Chorus]
Rise up, rise up, we're unstoppable tonight
Rise up, rise up, reaching for the light...
Best Results

For optimal lyrics generation:

  • Specify genre and mood clearly
  • Mention desired song structure (verse/chorus)
  • Include thematic keywords
  • Use Tier 1 languages (English, Chinese, Japanese, Korean)

Resources

Advanced Generation Modes

ModeDescriptionUse Case
VariationsGenerate alternatives from same promptExplore different interpretations
RepaintingRegenerate specific sectionsFix parts you don't like
Lyric EditingModify lyrics while keeping melodyAdjust words post-generation
ExtendContinue an existing generationCreate longer compositions

Architecture Overview

ACE-Step uses a novel architecture that combines the best of diffusion models and transformers:

┌────────────────────────────────────────────────────────────────────────┐
│ ACE-STEP GENERATION PIPELINE │
├────────────────────────────────────────────────────────────────────────┤
│ │
│ USER PROMPT │
│ "chill lofi beat with rain sounds and soft piano" │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ TEXT ENCODER (T5) │ │
│ │ Converts natural language to semantic embeddings │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Text Embeddings │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ LINEAR TRANSFORMER + DiT BLOCKS │ │
│ │ │ │
│ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │
│ │ │ DiT │───▶│ DiT │───▶│ DiT │──▶ ... │ │
│ │ │ Block 1 │ │ Block 2 │ │ Block 3 │ │ │
│ │ └───────────┘ └───────────┘ └───────────┘ │ │
│ │ ▲ ▲ ▲ │ │
│ │ │ │ │ │ │
│ │ ┌─────┴────────────────┴────────────────┴─────┐ │ │
│ │ │ TIMESTEP EMBEDDINGS │ │ │
│ │ │ (Diffusion step conditioning) │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ │ Latent Representations │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DCAE DECODER │ │
│ │ Deep Compression AutoEncoder (from Sana) │ │
│ │ Converts latent space → high-fidelity audio waveform │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ AUDIO OUTPUT │ │
│ │ 44.1kHz • Stereo • 320kbps MP3 │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└────────────────────────────────────────────────────────────────────────┘

Key Components

ComponentPurposeInnovation
DCAEAudio encoding/decodingDeep compression preserves acoustic details
Linear TransformerSequence modelingLightweight, efficient attention
REPASemantic alignmentMERT + m-hubert for faster convergence
Flow-MatchingGeneration processFaster than pure diffusion

For detailed architecture information, see AI Architecture Deep-Dive.

AceSteps Integration

Generation Flow

┌──────────────────────────────────────────────────────────────────────┐
│ ACESTEPS GENERATION FLOW │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ CREATOR ACESTEPS BLOCKCHAIN │
│ │ │ │ │
│ │ 1. Enter prompt │ │ │
│ │─────────────────────────▶│ │ │
│ │ │ │ │
│ │ 2. Validate & queue │ │
│ │ │ │ │
│ │ 3. GPU inference │ │
│ │ (Modal A10G) │ │
│ │ │ │ │
│ │ 4. Stream preview │ │ │
│ │◀─────────────────────────│ │ │
│ │ │ │ │
│ │ 5. Click "Save" │ │ │
│ │─────────────────────────▶│ │ │
│ │ │ │ │
│ │ 6. Generate signature │ │
│ │ (ECDSA sign) │ │
│ │ │ │ │
│ │ │ 7. mint(signature) │ │
│ │ │───────────────────────────▶│ │
│ │ │ │ │
│ │ │ 8. Verify & mint NFT │ │
│ │ │◀───────────────────────────│ │
│ │ │ │ │
│ │ 9. NFT in wallet! │ │ │
│ │◀─────────────────────────│ │ │
│ │
└──────────────────────────────────────────────────────────────────────┘

Signature Verification

Only AI-generated music can be minted on AceSteps. This is enforced through cryptographic signatures:

  1. Generation - Backend runs ACE-Step, stores audioHash
  2. Signing - Backend signs hash(userAddress + metadataURI + audioHash)
  3. Minting - Smart contract verifies signature via ECDSA
  4. Security - Prevents uploading copyrighted or non-AI content

For more details, see Backend Integration.

Prompt Engineering Tips

Effective Prompts

ElementGood ExamplePoor Example
Genre"lo-fi hip-hop beat""good music"
Mood"melancholic, nostalgic""sad"
Instruments"soft piano, vinyl crackle, muted drums""instruments"
Tempo"slow, 70 BPM"(not specified)
Style Reference"Nujabes-inspired jazz hop""like that one song"

Prompt Structure

[genre] + [mood/atmosphere] + [instruments] + [additional details]

Example:
"ambient electronic music with ethereal pads, gentle arpeggios,
and atmospheric textures, peaceful and meditative, space-themed"

For comprehensive prompt guidance, see Prompts Guide.

Limitations

Current Constraints

LimitationDetailsWorkaround
DurationMax 240 secondsUse "extend" for longer pieces
Seed SensitivitySame prompt can yield different resultsSave seeds you like
Niche GenresSome genres underperformCombine with well-supported genres
Vocal NuanceVocals can sound artificialFocus on instrumental or simple vocals
Long CoherenceStructure may drift >3 minKeep generations under 2 minutes

Not Supported

Intentionally Excluded

These features are by design not supported to ensure copyright compliance:

FeatureReason
Artist ImitationCopyright and likeness rights
Song RecreationDirect copyright infringement
Voice CloningPrivacy and consent concerns
Cover SongsRequires licensing

Comparison with Other Models

FeatureACE-StepSunoUdioMusicGen
Open SourceYesNoNoYes
Commercial LicenseApache 2.0ProprietaryProprietaryCC-BY-NC
NFT Minting LegalYesUnclearUnclearNo (NC)
Max Duration4 min4 min2 min30s
Speed (1 min)~3s~30s~60s~10s
Self-HostableYesNoNoYes
Fine-tuningLoRANoNoLimited

Resources

Citation

@misc{gong2025acestep,
title={ACE-Step: A Step Towards Music Generation Foundation Model},
author={Junmin Gong and Wenxiao Zhao and Sen Wang and Shengyuan Xu and Jing Guo},
howpublished={\url{https://github.com/ace-step/ACE-Step}},
year={2025}
}