ACE-Step 1.5 is an open-source music model similar to Suno, Udio, Mureka, Lyria, etc.
Workflows are embedded in the videos
Just drag and drop them into ComfyUI to import it (it includes an optional group of nodes to load a static image and turn it into a video along with the generated song).
Features
Lightweight: The model runs locally with less than 4GB of VRAM.
Fast: a 2 minutes song can be generated in a minute or so in a low-end GPU (<= 8GB VRAM).
Uncensored: you can prompt any lyrics you want.
Multiple languages: over 50 languages are supported officially.
Models
SFT 1.7B AIO checkpoint
I've merged the following models into a single checkpoint file:
acestep-v15-sft.safetensors: the SFT model, which sounds better than the Turbo version, while requiring only a few more steps.
qwen_0.6b_ace15.safetensors: mandatory text encoder.
qwen_1.7b_ace15.safetensors: additional 1.7B text encoder, which is many times faster than the 4B alternative while being almost as good in my own experiments).
ace_1.5_vae.safetensors: the VAE.
Simply download it to your ComfyUI/models/checkpoints folder.
Tips
Prompts
The model expects 2 optional prompts: caption/tags and lyrics/structure.
Caption/tags
Guide the model to what kind of genre and instruments you want in your music, as well vocals, mood, era, mixing style, etc.
Genres:
jazz,rap,rock,metal,hip hop,bossa nova,electronic,synthwave,blues,reggae, etc. A potential list of all 178k genres used during training can be found here.Instruments:
acoustic guitar,piano,drums,bass,synths,electric guitar,violin, etcVocals:
raspy male vocal,young female vocal,duet harmonies,whispering child, etcSee more tags in the official tutorial.
Either comma-separated tags, or natural language.
Multiple genres may be provided, but conflicts will likely harm the quality.
Even small changes will impact the result substantially.
Lyrics/structure
The model will sing better if each line contains between 6 and 10 syllables.
It's recommended to provide structure tags to organise your lyrics. Examples:
[Intro],[Verse],[Chorus],[Bridge],[Instrumental],[End], etc.
Add an empty line between structure blocks.
Inside a structure tag, you may add other hints. Examples:
[Intro - Dreamy],[Chorus - Layered vocals],[Instrumental - Guitar solo],[Bridge - Whispering], etc.
Some singing techniques/effects are recognised. Examples:
ACE-Step is heeere: hold the note for longer.For your mind (your mind): backing vocals.Stand up and SHOUT: sing with more power.[pt] Obrigado, amigo: switches to a different language[whispering] Don't be afraid: attempt to add said effect.
Metadata
All metadata are an effort to guide music attributes, but they might be overridden by the prompts.
The most relevant are listed below:
bpm: beats per minute, determines the tempo. Common distribution: slow songs 60–80, mid-tempo 90–120, fast songs 130–190.duration: target duration in seconds. The model officially supports 10s-600s. Short songs (30–60s) and medium length (2–4min) are stable; very long generation may have repetition or structure issues.timesignature: 4/4 (most common), 3/4 (waltz), 6/8 (swing feel).language: choose one of the many supported languages for the lyrics.keyscale: Affects overall pitch and emotional color. Usually Minor = darker mood; Major = brighter mood.generate_audio_codes: when enabled (recommended), spends much more time on the text encoder conditioning to improve song quality substantially.
Sampling
Most of the music (melody, harmony, cadence, etc) comes from the conditioning, so tweak sampling parameters to explore variations.
steps: the SFT model requires at least 20, but I recommend 30-50 for good results. Sometimes requires 50-100+ to improve failed parts.cfg: 1.0 is good enough, even for the SFT model. Increasing it to 2.0 seems to improve vocals while reducing presence of instruments. Over 2.0 starts to harm the output (do your own experiments though).sampler: my favourites:sa_solver_pece,heun,dpmpp_sde,uni_pc_bh2,euler, etc.scheduler: my favourites:beta,simple,kl_optimal, etc.
Description
Merge of the following ACE-Step 1.5 models into a single checkpoint file:
acestep-v15-sft.safetensors: the SFT model.
qwen_0.6b_ace15.safetensors: mandatory text encoder.
qwen_1.7b_ace15.safetensors: additional 1.7B text encoder for better audio codes.
ace_1.5_vae.safetensors: the VAE.