This workflow takes an Image and an audio track as input to generate a video.
Important Notice
Update ComfyUI and KJ Nodes. A lot of the code has been updated in the last few days.
Include --reserve-vram 1 in your launch option to avoid OOM.
If you have no lipsync, try ensuring that your audio track is in stereo format. fix suggested by @thomasdimitri563
Models to download (LTX2.3)
Place in models/diffusion_models
Place in models/loras
https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-22b-distilled-lora-384.safetensors
Place in models/text_encoders
Place in models/vae
https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_audio_vae_bf16.safetensors
https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_video_vae_bf16.safetensors
Models to download (V3)
Place in models/diffusion_models
https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-fp8.safetensors
Place in models/text_encoders
Place in models/loras
Description
Initial Release
FAQ
Comments (40)
black screen with music going on hmmm...another one that needs tinkering...
Update ComfyUI, KJ nodes and ComfyUI-gguf. Then try V2 of the workflow. I think the ComfyUI native loaders are more stable.
奇怪我的也是只有音乐,黑屏
@404710173603 can try v3 of the workflow. i changed it to use fp8 distilled models from LTX-2.
is this using the audio/video mask?
that node works, but its heavy, requires vae constantly active while sampling, trying to find a way around that, the extra 2.5gb of vae limits res or length for me, i need to buy a damn 5090, 2 4090's arn't enough anymore.
nevermind, noticed you mentioned update to kj, so updated mine, there is now a new audio/video mask without required vae injections.
Small tip:
If your RAM is not sufficient (like mine, 32 GB) and you have an SSD, enable virtual memory. Keep in mind that you’ll need to allocate (give up) some disk space for this. With this setup, I’m able to generate a 1280×720 video up to 20 seconds long.
hahahaha I am going to save you €200 euro bro.
If you use your SSD as a swap file and it's large - you are going to wreck that drive quite quickly if you generate a lot. Potentially, it's trashed in as little as a month.
I would run a health check on it. Also stop giving this advice please.
@Lady_Valeria In fact, a small amount is enough for this, there's no need to allocate 50-100GB of space, 4-8GB will be enough. If I produce 20-40 videos a day, I'll probably fill up the 4000TB data write-read lifespan in about 70-80 years, but thank you for considering me.
little confused by your workflow and description.
was looking at it, and why do you say download the embeddings connector, you arn't using it. i actually looked at this because i still aint figured out how to use gemma without loading the full model with it.
V1 of the workflow uses the embeddings connector with the dual clip loader. But I realised that the native loaders are more stable and changed to the native loaders in V2. Hence the embedding connector is no longer required. Let me change that on the description.
@PixelMuseAI ok, i've still never been able to get that embedding to work, not sure if it needs special node or what, but i can't get that to load at all.
@MrReclusive666 you use it with the dual clip loader.
see image on the model card: https://huggingface.co/Kijai/LTXV2_comfy
@PixelMuseAI yeah, i tried that, kept getting errors, probably cuz im not using the 14 billion parameter fp8 scalled gemma 3, im running 270m unsloth gemma 3, works fine normaly, and only 400mb, not 12gb
also whats this MelBandRoformer_fp32 ? i never had the need of this file in any ltx workflows
@p_p i looked at that, seems to separate music and vocals, not sure its needed though, ltx2 seems good enough at it.
@PixelMuseAI tried it exactly as described, with the gemma 3 fp8.
ValueError: Missing weight for layer gemma3_12b.transformer.model.layers.0.self_attn.q_proj
shrug probably because my unet and clip aren't on same gpu.
@MrReclusive666 apologies, I'm not too familiar with the way KJ intended the files to be used. So I'm of not much help here. Might want to reach out to KJ himself on Reddit / GitHub to get a better picture of what is going on.
@MrReclusive666 yes, this is correct. This is what the model does. It separate vocals from music. If your audio track has heavy music, separating might give better lipsync.
@PixelMuseAI I was able to get it running today, but found my 400mb unsloth gemma 3 model worked better, so sticking with that.
@MrReclusive666 thanks for your input, I'll test the unsloth model as well
Educate me please. Why are you using the full 27gb fp8 model as your VAE's instead of the actual VAE?
The model released by Lightricks has the vae baked into the model. I'm replacing the diffusion model with the GGUF version because q8 gives better quality than FP8.
@PixelMuseAI thanks for the answer going to add a two posts with a video, one using a more realistic character and one with a more artistic one. But every video i do irrelevant if realistic, artistic, or anime, has distortion in it like water artificing, wonder if you have any idea why?
@hot79770473 I might need to do some testing with your audio track, image and prompt. If you want my help to debug, then DM me with the input files. No guarantees I can get to the root of the problem. But my plan is to change out one variable at a time to see what solves it. I would try to play around with seed, sampler, maybe try increasing the diffusion steps. Or try the non distilled model.
@PixelMuseAI At this point i was semi able to resolve but i got lucky with seed 18 at specifically 480p resolution followed by a second pass upscaled with temporal at 3 steps to double resolution and that fixed a lot of the artificing. I think the whole two pass system from the LTX groups workflow might be more required than I initially expected. Don't get me wrong the closeups do fine with 1 pass but it seems for faster motion or more than upper body it falters. I'll still DM you the song and image if you wana play with it.
Any idea on how long your audio clip has to be? I am trying 30 sec clip right now. Not sure it will work. But it is at the Sampler now.
Not sure the VAE is correct in your WF. But mine keeps dying when it gets to the VAE part.
@darkwaterramen if you're having problems with the VAE I used, you can try the version by Kijai.
https://huggingface.co/Kijai/LTXV2_comfy/tree/main/VAE
Hope this helps.
LTX-2 has a limit of 20s. But I've not tried such long clips due to hardware limitations on my end. Interested to know what your long generations are like. It's still early days with LTX-2 so the community is still figuring out what it does well and does not do well.
@PixelMuseAI nice, these are pretty small.
It's crazy how good the quality is when zoomed in. Sad that it falls apart a bit for 2/3 shots or full body shots.
Thats what im struggling with, you can see the posts i made with my vids, fast motion or at a distance it just becomes a artifact mess. But ive tried other workflows and get the same results. Any thoughts? If you find a fix please share with me.
BTW the only way i found to mitigate it is to add a 2nd sampler system at the end of the workflow, an upscale at 3 steps using the LTX provided temporal. Basically doubles the gen time per video so i only do it once i got a good seed where the animation is right. I'll upload a 2nd video of the purple hair girl below so you can see it. No hand distortion or distortion as the camera zooms out.
Thanks for the comments, I'll do more testing when I get to my PC. I realised that the teeth get bad and mouth area gets blur when the resolution was low, that's why I decided to try high res single pass.
궁금한게있습니다!! ltx2는 원래 오디오를 같이 생성해주잖아요!! 근데 이 워크플로우는 오디오를 직접 넣게 되어있는데 그 이유가 뭐에요? 캡컷같은데서 할일을 그냥 해주는건가요? 아니면 내가 넣은 오디오에 맞춰서 영상이 생성되는건가요? 의도가 뭔지 궁금합니다!
저는 한국어를 못하고, 이것은 구글 번역입니다. 이 워크플로의 목적은 Suno와 같은 서비스나 AI 텍스트 음성 변환을 사용하여 오디오를 생성하고 사용자가 오디오를 더 자유롭게 제어할 수 있도록 하는 것입니다. 이미지(첫 번째 프레임)와 오디오를 입력하면 일관된 캐릭터를 만들 수 있도록 제어할 수 있습니다.
@PixelMuseAI thank you very much!
Kind of a noob question but can I somehow use this workflow to do straight audio to video without an image starting frame?
yes, you can. disable the LTXV Image To Video Inplace node.
@PixelMuseAI OMG perfect! Thank you!