LTX 2.3 Audio + Image to Video For Semi-Creative Slop

v1 is set up for audio + image inputs

v1.1 is for just audio input (but you can toggle back to i2v easily)

One step Audio+Image to Video for LTX-2.3. Based on this great starting point from PixelMuseAI. I've made modifications to suit my particular setup and cleaned it up and locked it down for consistent output. Took me a week to realize that I don't really need the upscale stage with LTX; I'm getting awesome quality with just one stage.

So this is a single stage workflow. LTX models are pretty chunky, so I don't know how well this works for dinky GPUs. I run this on my Cray X-MP 4090 with 24GB of VRAM and 128GB of V-less RAM. With all of the default settings here (704x1080, 10 seconds @24fps) the total run time is about 2 minutes. Which is insane. If you have less horsepower you may need to do the gguffy stuff. This is what works for me. Just make sure you put in a flag to reserve some VRAM so you don't get OOM. Whatever that means. I reserve 1 GB. The margin is razor-thin, but enough to let me fly through without getting bogged down in swap purgatory.

Anyway, the point of this workflow is to generate your slop such that it matches your input audio. If the slop is sexual, I'd recommend the heretic text encoder and a nsfw LoRA. You may get a huge list of CLIP errors in your terminal if you use an abliterated encoder but you can ignore them, they should not cause any halting errors. I have put in some sliders to control the weights of the optional LoRA loader. They are very important for this process.

v1.1 addresses direct t2v with no starting image

If ~~you want to do straight T2V~~ ~~with the input audio, toggle the~~ ~~bypass~~ ~~on the~~ ~~LTXVImgToVideoInplace~~ ~~node to~~ ON~~. Put a dummy image into the loader, it can be anything, a tiny white square, whatever. Just set your desired size on the~~ ~~Resize Image v2~~ node. That's it. With the bypass on, the input will be ignored and you will be full T2V. You can also play with the strength of the inplace (not bypassed of course) if you want to use the input image as a rough guide. Just remember with T2V you must be much more specific and detailed with your prompt. If the output is too blurry, you can throw in the auto LTXV scheduler (the one with a step widget) to replace the manual sigmas on the sampler. Four clicks. No problem. You can do it.

There are are plenty of notes in the WF, but mostly it's plug-and-play. The duration you enter into the audio loader node is automatically multiplied by your framerate, +1, to make it conform with LTX rules. You can go much longer, and much higher resolution of course. LTX is just friggin awesome. WAN is feeling kind of dead. And MMAudio is obviated as well. Pair this up with TTS audio suite and/or Ace-Step and you're in business.

Final frame saved for non-conditioned continuation. LTX does a remarkable job even without the continuation guidance, if you're feeling lazy. But it is definitely better if you take the time to set it up.

If you're a custom node whiner, go ahead and replace what you don't have with some generic crap. But nothing in here has crazy dependencies or is hard to install.

*I have two video combines here because something is broken in my VLC install that I can't fix - anything that saves with audio has broken comfy metadata. So I only use VHS for the looping preview, and the create + save is for my output. Either one can be deleted if it is redundant for you.

Model links in the WF. The usual LTX stuff. Dev + 304 distill at 0.7 works great for me. The schedule is manually set up for it, if you want to run full, substitute some sexy sigmas.

I have found using heretic is necessary to force seriously naughty prompts through. With regular abliterated I tend to get slow motion, or slightly tamed-down adherence. Use fp8 if your main model is fp8.

https://huggingface.co/DreamFast/gemma-3-12b-it-heretic/tree/main/comfyui

My Ace-Step Workflow:

https://civarchive.com/models/2351803/ace-step-for-your-ear-holes

1.1 is set up for T2V, if you're into that sort of thing. I prefer a nice ultra-detailed image. But go for it, really.

I left the image loader as-is, so it's easy to switch between I2V and T2V. Any image is fine - with the inplace node in bypass mode the values in the resize image node is your empty latent size. So just throw in a dummy image of anything.

LTXV scheduler replaces manual sigmas, might be easier to tune your schedule for T2V with this. The manual sigma string is still there if you want to use it. Just remember that scheduling T2V is different than I2V. (Don't connect the latent input)

NAG node added, again to help prompt adherence for T2V

Take note of the decode settings. The default here is optimized for 1080p. Use the other one for 720. If you don't get your tile size right, awful things will happen to you.

If you want to switch back to I2V just toggle the LTXVImgToVideoInplace bypass to OFF and adjust sigmas.

Description

FAQ

Details

Files

ltx23AudioImageToVideo_v11.zip

Mirrors

Description

FAQ

What is LTX 2.3 Audio + Image to Video For Semi-Creative Slop?

What files are available and where can I download them?

Details

Files

ltx23AudioImageToVideo_v11.zip

Mirrors