Ace-Step For Your Ear Holes - CivArchive (CivitAI Archive)

Ace-Step For Your Ear Holes - v2 Ace-Step 1.5 Turbo

NSFW

Recent observations-

The model has an annoying propensity for modal madness. If you're a normal person you may not notice this, if you're a musician it will stick out like a sore thumb. It's odd that the new encoder has a key widget; you'd think this would keep it locked to a key. But it's still wandering like crazy, even with euler. ~~So I'm doing experiments with samplers.~~ I'm uploading some test results for inpainting. It's really variable, the usual infinite permutation problem. Pay attention to aura flow shift values and play with samplers, a lot depends on what you prefer. If you can't tell when the key changes every five seconds, go wild. I'm jealous. But it's not that bad, you just have to dial it in.
It's now much beefier than it used to be- you may get OOM (message) on decode, but it handles this with aplomb. It just lets you know, and switches to tiled decode by itself. Isn't that delightful? There are dedicated audio tile decode nodes you can add as well.

So the new model is here, 1.5 Turbo. It's pretty great. It has its idiosyncrasies, like the previous version, they're just different. The very latest comfyui update has native Ace-Step nodes, so all you really need is to update comfy. I've got a bunch of custom nodes in the new workflow, but none of them are required for the actual encoding/sampling. It's all logic. I could not switch the latents in the old version, but I can with these, so there is a switch for toggling between new empty latent and inpainting (remixing). There is also a switch if you want to compare models. The AIO is a checkpoint, and goes just where you think it does. The other way to do it is with the diffusion model, two text encoders (dual-clip loader) and a VAE. If you just want one method, delete the loaders you don't need and the switch.

The new encoding node has many more features. Since it now has a BPM widget, figuring out the appropriate word count can be tricky. So I automated all of that annoying stuff. Just enter your duration and your BPM and after running you will get the calculations that tell you (based on your lyrics and the widget values you set) the total number of bars in the song, what the optimal word count is, what your actual count is, and the difference between the two. This can help you dial it in. A lot of artifacts people get are coming from a mismatch between their duration and the prompt. It's not the model- it's you. You suck, stop it.

I included a slider in the logic section to adjust the vocal speed as part of the calculated values. For example, a really fast rapper vs a melismatic diva.

A note on the time signature - because the input to the signature is a combo, I had difficulty connecting it to the logic (and it most definitely needs to be) so there is a separate INT node for the logic. If you change the time signature on the encoder, make sure to change it on the INT node. It's right next to it. This will ensure that your calculations are correct.

If you don't have any experience with TTS and have no idea what temperature, p and k values are, don't mess with them too much. It's a bit like CFG, they target specific ranges and limits on how much freedom the model has. Best bet is to leave them alone, CFG too, and focus mainly on the prompting. The model is now better at natural-language prompting in the style section. Don't go by my default text, that's just carried over from the previous workflow. It can be written out descriptively now instead of being parsed into simple tokens.

Euler works pretty good as a base. Others work as well, dpmpp_2m gives interesting results. Just be warned that it has no idea what a key is. Your output will be a multimodal wet dream. It's pretty cool if that effect is what you want.

Original v1 description:

This is pretty much Ace-Step right out of the box. It's there in the templates. I post this what for to remind you that it exists. Silent movies are boring. And none of you even have the decency to include intertitles. Admittedly, they'd have to be mostly onomatopoeia, but you could try. Here is a friendly reminder that you can make awful music just as easily as you make awful pictures. Easier actually. Way, way easier. Just use it. You don't even need to download this workflow. It's right there. End of PSA.

Description

Big changes. All new. Both options included- checkpoint and separate loaders, switched. Added lyrics logic to help you optimize word count based on duration, BPM, style. Automated inpaint switching. Fancy.

FAQ

Comments (3)

PartisanoFeb 9, 2026

CivitAI

This model might actually unlock a whole new level of possibilities thanks to LoRA training.. something virtually no one is paying attention to at the moment

Ponder_Stibbons

Author

Feb 9, 2026

Oh crap. I totally forgot to add a LoRA loader. I was so pleased that the new version doesn't bug out when you switch latents like the old one does, and caught up with the other new features, wanted to update the wf, guess I was too hasty. Not that it takes a rocket surgeon to add a loader. But it should be in there. I completely agree with you. Actually it might turn out to need the help in that respect. Whereas consistent quality was the main struggle before, now it seems to be variation that needs help. It's nice that there is now much more TTS-like tuning, seems like direct tuning of temp, p, k values aren't that effective in directing the model.

I do definitely need to fire up gradio and do some training. God knows nobody else is going to... there are what, two or three on this site for the old model? Not much.

It sucks that music models are so underrepresented here. I've got a billion awesome generations, and no good way to share them without just posting a still-image video, which is boring, or a compilation, which takes time to assemble. And Ace-Step isn't listed as a resource anywhere- I really only posted the original wf because there was nothing to cite, v1 is pretty much the template, as-is.

PartisanoFeb 10, 2026

@Ponder_Stibbons So I've been trying to train LoRA through Pinokio + Ace - would happily drop results here, but training is broken at the moment. Dataset setup completes, preprocessed tensors done and.. it stops.

Think about training instrument-specific LoRAs and actually being able to inject that style control into your regular pipeline. Would change the game. 100% sure it's being done already on sooo many levels.. somewhere.. lol.

Workflows

Other

by Ponder_Stibbons

Download (Beta) View on CivitAI