CivArchive
    LTX-2 Image Audio to Video - v2.0

    This workflow takes an Image and an audio track as input to generate a video.
    Important Notice

    Update ComfyUI and KJ Nodes. A lot of the code has been updated in the last few days.

    Include --reserve-vram 1 in your launch option to avoid OOM.

    If you have no lipsync, try ensuring that your audio track is in stereo format. fix suggested by @thomasdimitri563

    Models to download (LTX2.3)

    Place in models/diffusion_models

    https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/diffusion_models/ltx-2.3-22b-dev_transformer_only_fp8_scaled.safetensors

    Place in models/loras

    https://huggingface.co/Lightricks/LTX-2.3/blob/main/ltx-2.3-22b-distilled-lora-384.safetensors

    Place in models/text_encoders

    https://huggingface.co/Comfy-Org/ltx-2/resolve/main/split_files/text_encoders/gemma_3_12B_it_fp4_mixed.safetensors

    https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/text_encoders/ltx-2.3_text_projection_bf16.safetensors

    Place in models/vae

    https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_audio_vae_bf16.safetensors

    https://huggingface.co/Kijai/LTX2.3_comfy/blob/main/vae/LTX23_video_vae_bf16.safetensors

    Models to download (V3)

    Place in models/diffusion_models

    https://huggingface.co/Lightricks/LTX-2/resolve/main/ltx-2-19b-distilled-fp8.safetensors

    Place in models/text_encoders

    https://huggingface.co/Comfy-Org/ltx-2/resolve/main/split_files/text_encoders/gemma_3_12B_it_fp4_mixed.safetensors

    Place in models/loras

    https://huggingface.co/Lightricks/LTX-2-19b-IC-LoRA-Detailer/resolve/main/ltx-2-19b-ic-lora-detailer.safetensors

    https://huggingface.co/Lightricks/LTX-2-19b-LoRA-Camera-Control-Static/resolve/main/ltx-2-19b-lora-camera-control-static.safetensors

    Description

    Changed to use the native comfyui loaders.

    Changed to allow loading of an audio file for input.

    FAQ

    Comments (40)

    erdelman73267Jan 14, 2026
    CivitAI

    black screen with music going on hmmm...another one that needs tinkering...

    PixelMuseAI
    Author
    Jan 14, 2026

    Update ComfyUI, KJ nodes and ComfyUI-gguf. Then try V2 of the workflow. I think the ComfyUI native loaders are more stable.

    404710173603Jan 17, 2026

    奇怪我的也是只有音乐,黑屏

    PixelMuseAI
    Author
    Jan 18, 2026· 1 reaction

    @404710173603 can try v3 of the workflow. i changed it to use fp8 distilled models from LTX-2.

    MrReclusive666Jan 14, 2026
    CivitAI

    is this using the audio/video mask?
    that node works, but its heavy, requires vae constantly active while sampling, trying to find a way around that, the extra 2.5gb of vae limits res or length for me, i need to buy a damn 5090, 2 4090's arn't enough anymore.

    MrReclusive666Jan 14, 2026

    nevermind, noticed you mentioned update to kj, so updated mine, there is now a new audio/video mask without required vae injections.

    goldennyks76Jan 14, 2026
    CivitAI

    Small tip:
    If your RAM is not sufficient (like mine, 32 GB) and you have an SSD, enable virtual memory. Keep in mind that you’ll need to allocate (give up) some disk space for this. With this setup, I’m able to generate a 1280×720 video up to 20 seconds long.

    Lady_ValeriaJan 14, 2026

    hahahaha I am going to save you €200 euro bro.
    If you use your SSD as a swap file and it's large - you are going to wreck that drive quite quickly if you generate a lot. Potentially, it's trashed in as little as a month.

    I would run a health check on it. Also stop giving this advice please.

    goldennyks76Jan 14, 2026· 2 reactions

    @Lady_Valeria In fact, a small amount is enough for this, there's no need to allocate 50-100GB of space, 4-8GB will be enough. If I produce 20-40 videos a day, I'll probably fill up the 4000TB data write-read lifespan in about 70-80 years, but thank you for considering me.

    MrReclusive666Jan 14, 2026
    CivitAI

    little confused by your workflow and description.
    was looking at it, and why do you say download the embeddings connector, you arn't using it. i actually looked at this because i still aint figured out how to use gemma without loading the full model with it.

    PixelMuseAI
    Author
    Jan 14, 2026· 1 reaction

    V1 of the workflow uses the embeddings connector with the dual clip loader. But I realised that the native loaders are more stable and changed to the native loaders in V2. Hence the embedding connector is no longer required. Let me change that on the description.

    MrReclusive666Jan 14, 2026

    @PixelMuseAI ok, i've still never been able to get that embedding to work, not sure if it needs special node or what, but i can't get that to load at all.

    PixelMuseAI
    Author
    Jan 14, 2026

    @MrReclusive666 you use it with the dual clip loader.

    see image on the model card: https://huggingface.co/Kijai/LTXV2_comfy

    MrReclusive666Jan 14, 2026

    @PixelMuseAI yeah, i tried that, kept getting errors, probably cuz im not using the 14 billion parameter fp8 scalled gemma 3, im running 270m unsloth gemma 3, works fine normaly, and only 400mb, not 12gb

    Agent_SmthJan 14, 2026

    also whats this MelBandRoformer_fp32 ? i never had the need of this file in any ltx workflows

    MrReclusive666Jan 14, 2026

    @p_p i looked at that, seems to separate music and vocals, not sure its needed though, ltx2 seems good enough at it.

    MrReclusive666Jan 14, 2026

    @PixelMuseAI tried it exactly as described, with the gemma 3 fp8.
    ValueError: Missing weight for layer gemma3_12b.transformer.model.layers.0.self_attn.q_proj
    shrug probably because my unet and clip aren't on same gpu.

    PixelMuseAI
    Author
    Jan 14, 2026

    @MrReclusive666 apologies, I'm not too familiar with the way KJ intended the files to be used. So I'm of not much help here. Might want to reach out to KJ himself on Reddit / GitHub to get a better picture of what is going on.

    PixelMuseAI
    Author
    Jan 14, 2026

    @MrReclusive666 yes, this is correct. This is what the model does. It separate vocals from music. If your audio track has heavy music, separating might give better lipsync.

    MrReclusive666Jan 14, 2026

    @PixelMuseAI I was able to get it running today, but found my 400mb unsloth gemma 3 model worked better, so sticking with that.

    PixelMuseAI
    Author
    Jan 15, 2026

    @MrReclusive666 thanks for your input, I'll test the unsloth model as well

    hot79770473Jan 14, 2026
    CivitAI

    Educate me please. Why are you using the full 27gb fp8 model as your VAE's instead of the actual VAE?

    PixelMuseAI
    Author
    Jan 14, 2026· 1 reaction

    The model released by Lightricks has the vae baked into the model. I'm replacing the diffusion model with the GGUF version because q8 gives better quality than FP8.

    hot79770473Jan 15, 2026

    @PixelMuseAI thanks for the answer going to add a two posts with a video, one using a more realistic character and one with a more artistic one. But every video i do irrelevant if realistic, artistic, or anime, has distortion in it like water artificing, wonder if you have any idea why?

    PixelMuseAI
    Author
    Jan 15, 2026

    @hot79770473 I might need to do some testing with your audio track, image and prompt. If you want my help to debug, then DM me with the input files. No guarantees I can get to the root of the problem. But my plan is to change out one variable at a time to see what solves it. I would try to play around with seed, sampler, maybe try increasing the diffusion steps. Or try the non distilled model.

    hot79770473Jan 15, 2026

    @PixelMuseAI  At this point i was semi able to resolve but i got lucky with seed 18 at specifically 480p resolution followed by a second pass upscaled with temporal at 3 steps to double resolution and that fixed a lot of the artificing. I think the whole two pass system from the LTX groups workflow might be more required than I initially expected. Don't get me wrong the closeups do fine with 1 pass but it seems for faster motion or more than upper body it falters. I'll still DM you the song and image if you wana play with it.

    darkwaterramenJan 15, 2026
    CivitAI

    Any idea on how long your audio clip has to be? I am trying 30 sec clip right now. Not sure it will work. But it is at the Sampler now.

    darkwaterramenJan 15, 2026

    Not sure the VAE is correct in your WF. But mine keeps dying when it gets to the VAE part.

    PixelMuseAI
    Author
    Jan 15, 2026

    @darkwaterramen if you're having problems with the VAE I used, you can try the version by Kijai.

    https://huggingface.co/Kijai/LTXV2_comfy/tree/main/VAE

    Hope this helps.

    PixelMuseAI
    Author
    Jan 15, 2026

    LTX-2 has a limit of 20s. But I've not tried such long clips due to hardware limitations on my end. Interested to know what your long generations are like. It's still early days with LTX-2 so the community is still figuring out what it does well and does not do well.

    darkwaterramenJan 15, 2026

    @PixelMuseAI nice, these are pretty small.

    Lady_ValeriaJan 15, 2026
    CivitAI

    It's crazy how good the quality is when zoomed in. Sad that it falls apart a bit for 2/3 shots or full body shots.

    hot79770473Jan 15, 2026

    Thats what im struggling with, you can see the posts i made with my vids, fast motion or at a distance it just becomes a artifact mess. But ive tried other workflows and get the same results. Any thoughts? If you find a fix please share with me.

    BTW the only way i found to mitigate it is to add a 2nd sampler system at the end of the workflow, an upscale at 3 steps using the LTX provided temporal. Basically doubles the gen time per video so i only do it once i got a good seed where the animation is right. I'll upload a 2nd video of the purple hair girl below so you can see it. No hand distortion or distortion as the camera zooms out.

    PixelMuseAI
    Author
    Jan 15, 2026

    Thanks for the comments, I'll do more testing when I get to my PC. I realised that the teeth get bad and mouth area gets blur when the resolution was low, that's why I decided to try high res single pass.

    NiceKrissJan 16, 2026
    CivitAI

    궁금한게있습니다!! ltx2는 원래 오디오를 같이 생성해주잖아요!! 근데 이 워크플로우는 오디오를 직접 넣게 되어있는데 그 이유가 뭐에요? 캡컷같은데서 할일을 그냥 해주는건가요? 아니면 내가 넣은 오디오에 맞춰서 영상이 생성되는건가요? 의도가 뭔지 궁금합니다!

    PixelMuseAI
    Author
    Jan 16, 2026

    저는 한국어를 못하고, 이것은 구글 번역입니다. 이 워크플로의 목적은 Suno와 같은 서비스나 AI 텍스트 음성 변환을 사용하여 오디오를 생성하고 사용자가 오디오를 더 자유롭게 제어할 수 있도록 하는 것입니다. 이미지(첫 번째 프레임)와 오디오를 입력하면 일관된 캐릭터를 만들 수 있도록 제어할 수 있습니다.

    NiceKrissJan 18, 2026

    @PixelMuseAI thank you very much!

    tommytom123406123Jan 18, 2026
    CivitAI

    Kind of a noob question but can I somehow use this workflow to do straight audio to video without an image starting frame?

    PixelMuseAI
    Author
    Jan 18, 2026· 1 reaction

    yes, you can. disable the LTXV Image To Video Inplace node.

    tommytom123406123Jan 20, 2026

    @PixelMuseAI OMG perfect! Thank you!

    Workflows
    LTXV2

    Details

    Downloads
    641
    Platform
    CivitAI
    Platform Status
    Available
    Created
    1/14/2026
    Updated
    6/24/2026
    Deleted
    -

    Files

    ltx2ImageAudioTo_v20.zip

    Mirrors