CivArchive
    Ace-Step For Your Ear Holes - v2 Ace-Step 1.5 Turbo
    NSFW
    Preview 120421217

    Recent observations-

    • The model has an annoying propensity for modal madness. If you're a normal person you may not notice this, if you're a musician it will stick out like a sore thumb. It's odd that the new encoder has a key widget; you'd think this would keep it locked to a key. But it's still wandering like crazy, even with euler. So I'm doing experiments with samplers. I'm uploading some test results for inpainting. It's really variable, the usual infinite permutation problem. Pay attention to aura flow shift values and play with samplers, a lot depends on what you prefer. If you can't tell when the key changes every five seconds, go wild. I'm jealous. But it's not that bad, you just have to dial it in.

    • It's now much beefier than it used to be- you may get OOM (message) on decode, but it handles this with aplomb. It just lets you know, and switches to tiled decode by itself. Isn't that delightful? There are dedicated audio tile decode nodes you can add as well.


    So the new model is here, 1.5 Turbo. It's pretty great. It has its idiosyncrasies, like the previous version, they're just different. The very latest comfyui update has native Ace-Step nodes, so all you really need is to update comfy. I've got a bunch of custom nodes in the new workflow, but none of them are required for the actual encoding/sampling. It's all logic. I could not switch the latents in the old version, but I can with these, so there is a switch for toggling between new empty latent and inpainting (remixing). There is also a switch if you want to compare models. The AIO is a checkpoint, and goes just where you think it does. The other way to do it is with the diffusion model, two text encoders (dual-clip loader) and a VAE. If you just want one method, delete the loaders you don't need and the switch.

    The new encoding node has many more features. Since it now has a BPM widget, figuring out the appropriate word count can be tricky. So I automated all of that annoying stuff. Just enter your duration and your BPM and after running you will get the calculations that tell you (based on your lyrics and the widget values you set) the total number of bars in the song, what the optimal word count is, what your actual count is, and the difference between the two. This can help you dial it in. A lot of artifacts people get are coming from a mismatch between their duration and the prompt. It's not the model- it's you. You suck, stop it.

    I included a slider in the logic section to adjust the vocal speed as part of the calculated values. For example, a really fast rapper vs a melismatic diva.

    A note on the time signature - because the input to the signature is a combo, I had difficulty connecting it to the logic (and it most definitely needs to be) so there is a separate INT node for the logic. If you change the time signature on the encoder, make sure to change it on the INT node. It's right next to it. This will ensure that your calculations are correct.

    If you don't have any experience with TTS and have no idea what temperature, p and k values are, don't mess with them too much. It's a bit like CFG, they target specific ranges and limits on how much freedom the model has. Best bet is to leave them alone, CFG too, and focus mainly on the prompting. The model is now better at natural-language prompting in the style section. Don't go by my default text, that's just carried over from the previous workflow. It can be written out descriptively now instead of being parsed into simple tokens.

    Euler works pretty good as a base. Others work as well, dpmpp_2m gives interesting results. Just be warned that it has no idea what a key is. Your output will be a multimodal wet dream. It's pretty cool if that effect is what you want.


    Original v1 description:

    This is pretty much Ace-Step right out of the box. It's there in the templates. I post this what for to remind you that it exists. Silent movies are boring. And none of you even have the decency to include intertitles. Admittedly, they'd have to be mostly onomatopoeia, but you could try. Here is a friendly reminder that you can make awful music just as easily as you make awful pictures. Easier actually. Way, way easier. Just use it. You don't even need to download this workflow. It's right there. End of PSA.

    Description

    Big changes. All new. Both options included- checkpoint and separate loaders, switched. Added lyrics logic to help you optimize word count based on duration, BPM, style. Automated inpaint switching. Fancy.

    FAQ

    Workflows
    Other

    Details

    Downloads
    209
    Platform
    CivitAI
    Platform Status
    Available
    Created
    2/8/2026
    Updated
    4/27/2026
    Deleted
    -

    Files

    aceStepForYourEar_v2AceStep15Turbo.zip