ACE-Step 1.5 - Music generation for low VRAM - SFT 1.7B AIO

NSFW

ACE-Step 1.5 is an open-source music model similar to Suno, Udio, Mureka, Lyria, etc.

Workflows are embedded in the videos

Just drag and drop them into ComfyUI to import it (it includes an optional group of nodes to load a static image and turn it into a video along with the generated song).

Features

Lightweight: The model runs locally with less than 4GB of VRAM.
Fast: a 2 minutes song can be generated in a minute or so in a low-end GPU (<= 8GB VRAM).
Uncensored: you can prompt any lyrics you want.
Multiple languages: over 50 languages are supported officially.

Models

SFT 1.7B AIO checkpoint

I've merged the following models into a single checkpoint file:

acestep-v15-sft.safetensors: the SFT model, which sounds better than the Turbo version, while requiring only a few more steps.
qwen_0.6b_ace15.safetensors: mandatory text encoder.
qwen_1.7b_ace15.safetensors: additional 1.7B text encoder, which is many times faster than the 4B alternative while being almost as good in my own experiments).
ace_1.5_vae.safetensors: the VAE.

Simply download it to your ComfyUI/models/checkpoints folder.

Tips

Prompts

The model expects 2 optional prompts: caption/tags and lyrics/structure.

Caption/tags

Guide the model to what kind of genre and instruments you want in your music, as well vocals, mood, era, mixing style, etc.
- Genres: jazz, rap, rock, metal, hip hop, bossa nova, electronic, synthwave, blues, reggae, etc. A potential list of all 178k genres used during training can be found here.
- Instruments: acoustic guitar, piano, drums, bass, synths, electric guitar, violin, etc
- Vocals: raspy male vocal, young female vocal, duet harmonies, whispering child, etc
- See more tags in the official tutorial.
Either comma-separated tags, or natural language.
Multiple genres may be provided, but conflicts will likely harm the quality.
Even small changes will impact the result substantially.

Lyrics/structure

The model will sing better if each line contains between 6 and 10 syllables.
It's recommended to provide structure tags to organise your lyrics. Examples:
- [Intro], [Verse], [Chorus], [Bridge], [Instrumental], [End], etc.
Add an empty line between structure blocks.
Inside a structure tag, you may add other hints. Examples:
- [Intro - Dreamy], [Chorus - Layered vocals], [Instrumental - Guitar solo], [Bridge - Whispering], etc.
Some singing techniques/effects are recognised. Examples:
- ACE-Step is heeere: hold the note for longer.
- For your mind (your mind): backing vocals.
- Stand up and SHOUT: sing with more power.
- [pt] Obrigado, amigo: switches to a different language
- [whispering] Don't be afraid: attempt to add said effect.

Metadata

All metadata are an effort to guide music attributes, but they might be overridden by the prompts.

The most relevant are listed below:

bpm: beats per minute, determines the tempo. Common distribution: slow songs 60–80, mid-tempo 90–120, fast songs 130–190.
duration: target duration in seconds. The model officially supports 10s-600s. Short songs (30–60s) and medium length (2–4min) are stable; very long generation may have repetition or structure issues.
timesignature: 4/4 (most common), 3/4 (waltz), 6/8 (swing feel).
language: choose one of the many supported languages for the lyrics.
keyscale: Affects overall pitch and emotional color. Usually Minor = darker mood; Major = brighter mood.
generate_audio_codes: when enabled (recommended), spends much more time on the text encoder conditioning to improve song quality substantially.

Sampling

Most of the music (melody, harmony, cadence, etc) comes from the conditioning, so tweak sampling parameters to explore variations.

steps: the SFT model requires at least 20, but I recommend 30-50 for good results. Sometimes requires 50-100+ to improve failed parts.
cfg: 1.0 is good enough, even for the SFT model. Increasing it to 2.0 seems to improve vocals while reducing presence of instruments. Over 2.0 starts to harm the output (do your own experiments though).
sampler: my favourites: sa_solver_pece, heun, dpmpp_sde, uni_pc_bh2, euler, etc.
scheduler: my favourites: beta, simple, kl_optimal, etc.

Description

Merge of the following ACE-Step 1.5 models into a single checkpoint file:

acestep-v15-sft.safetensors: the SFT model.
qwen_0.6b_ace15.safetensors: mandatory text encoder.
qwen_1.7b_ace15.safetensors: additional 1.7B text encoder for better audio codes.
ace_1.5_vae.safetensors: the VAE.

FAQ

Comments (10)

lanceshockerFeb 27, 2026

CivitAI

This... is a little interesting. Is this only music generation audio or is it like video with audio?

Is there a workflow for this?

SimplesmenteIA

Author

Feb 27, 2026

ACE-Step 1.5 is for music generation only, but the demo videos are just a text2music workflow that I join with a static image to create the video. (The workflow is embedded in the demo videos, just drop it into ComfyUI to import it)

Bottomless_MasterFeb 28, 2026· 2 reactions

CivitAI

Muito bom! Sigo seu canal no youtube e la só tem conteudo excelente. To aguardando um video seu sobre como criar um Lora

Gratidão mano!

SimplesmenteIA

Author

Feb 28, 2026· 1 reaction

Valeu pelo apoio! Esse ano ainda sai o vídeo de LoRA kkkk Abraço

fromshahnahnahb957Mar 2, 2026· 2 reactions

CivitAI

Wow. The quality is incredible. Goodbye music industry...lol Thank you! This is amazing to have.

SimplesmenteIA

Author

Mar 3, 2026

Thank the folks from ACE Studio and Step Fun, who implemented the model, I'm just the messenger ;)

burnera679889Apr 10, 2026

CivitAI

I tried loading the workflows into swarmui's comfy setup but it couldnt read them, is it possible to have a json of one of them? Im trying to figure out how to get audio to audio working with SFT and havent found a good workflow

SimplesmenteIA

Author

Apr 20, 2026

After battling with Civitai, I finally managed to upload the workflow under "Training Data".

Tomary68Apr 16, 2026

CivitAI

fonctionne t'il aussi sur gpu amd? j ai essayé le workflow qu il y a sur l image reggae, il s execute correctement, mais le mp3 en sortie n a aucun son

SimplesmenteIA

Author

Apr 20, 2026

Good question. I'm afraid I don't have an AMD GPU to test it. Perhaps replace the "Save Audio (MP3)" with another node, like "Preview Audio" or even "Save Audio (FLAC)".

Checkpoint

ACE Audio

by SimplesmenteIA

Download (Beta) View on CivitAI

Details

Downloads

866

Platform

CivitAI

Platform Status

Available

Created

2/27/2026

Updated

6/11/2026

Deleted

Files

aceStep15Music_sft17BAIO_trainingData.zip

Size:

4.21 KB

SHA256:

e1275fa1373d28294b23bba34cdd98d41f90db46400e1fc83b647be017d075d6

Mirrors

CivitAI (2 mirrors)

aceStep15Music_sft17BAIO_trainingData.zip

Size:

4.21 KB

SHA256:

e1275fa1373d28294b23bba34cdd98d41f90db46400e1fc83b647be017d075d6

Mirrors

CivitAI (2 mirrors)

aceStep15Music_sft17BAIO_trainingData.zip

aceStep15Music_sft17BAIO.safetensors

Size:

9.34 GB

SHA256:

2c580648c93228cb0a655c7fe15739aa9a90fa3e4c49aac2a322e6d643df7d1c

Mirrors

HuggingFace (1 mirrors)

acestep_v1.5_sft_lm_1.7b_aio.safetensors

CivitAI (1 mirrors)

aceStep15Music_sft17BAIO.safetensors

Workflows are embedded in the videos

Features

Models

SFT 1.7B AIO checkpoint

Tips

Prompts

Caption/tags

Lyrics/structure

Metadata

Sampling

Description

FAQ

What is ACE-Step 1.5 - Music generation for low VRAM?

How do I use ACE-Step 1.5 - Music generation for low VRAM?

What files are available and where can I download them?

Comments (10)

Details

Files

aceStep15Music_sft17BAIO_trainingData.zip

Mirrors

aceStep15Music_sft17BAIO_trainingData.zip

Mirrors

aceStep15Music_sft17BAIO.safetensors

Mirrors