base workflow for Audio+Image to video for Dev model. LOW VRAM as possible.
can also generate text to video with audio reference. (switch red boolean node to TRUE)
i suggest leaving the prompt alone unless you want to prompt for a specific motion or action to occur.
prompt:
" Transform this static image into a high-quality video with with realistic facial expressions and realistic motion.
Perfect lip-sync to the attached audio. "
FILES:
OPTIONAL Kijais fp8 Scaled (requires load diffusion model node instead of unet loader node and replaces the gguf entirely. )
https://huggingface.co/Kijai/LTX2.3_comfy/tree/main/diffusion_models
DEV gguf (distilled ggufs are in the repo as well)
https://huggingface.co/unsloth/LTX-2.3-GGUF/tree/main
Gemma 3_12B FP4 text encoder
Audio VAE
https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_audio_vae_bf16.safetensors
Video VAE
https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_video_vae_bf16.safetensors
Text Projection text encoder
https://huggingface.co/Kijai/LTX2.3_comfy/tree/main/text_encoders
Distill Lora
https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-22b-distilled-lora-384.safetensors
Upscaler
https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-spatial-upscaler-x2-1.1.safetensors
Description
A+I2V
FAQ
Comments (6)
Very cool wf, had to modify it so it would take my tensor file and added more lora nodes, but other than that, quite simple and clear to work with, thanks!
and if first + last frame ?
I don't know. It's not working right. See the posted video. It should have the workflow in it. The only difference is that I used Q8_0 gguf and gemma_3_12b_it text encoder. Oh, and I used resolution 720x1024. Everything else is the same as in the sample workflow.
Funny part is that I tried the sample image (the guy in a baseball hat) and sound clip and it worked. Was using the same Q8_0 gguf and gemma_3_12b_it text encoder and changed the resolution to 768x768. But my own audio and images do not work even when using the same lowered resolution. What gives?
IDK i am also having huge problems. even with the official workflow. it eiter throws errors, or it speaks alien language, or everythin looks bloomed, or subtitles everywhere, or general bad movements, and i2v is a joke.
@chrisbraeuer41172035 Well, with straight i2v, I managed to get some decent clips with various workflows, including the default ComfyUI one. Lots of duds, but some clips are pretty decent. But, I've tried several ai2v workflows and none of them works halfway decent.
@creatorjulie743 I am really not sure. I also got some decent clips. Woman in protrait mode speaking works great. speaking portraits in general. But as sonn as i try to do something different if falls off a cliff. Its drinving me nuts. Just trying to let someone go up some stairs. Not possible at all.