LTX IMAGE to VIDEO with STG, CAPTION & CLIP EXTEND workflow

LTX IMAGE to VIDEO with STG, CAPTION & CLIP EXTEND workflow - v7.0 (LTX 0.9.5)

NSFW

NEW LTX-2 Workflows here: https://civarchive.com/models/2318870

Workflow: Image -> Autocaption (Prompt) -> LTX Image to Video

LTX Prompt Enhancer (LTXPE) might have issues with latest Comfy and Lightricks update

Update July 20th 2025: GGUF Models for LTX 0.9.8:

Distilled model, works with V9.5: https://huggingface.co/QuantStack/LTXV-13B-0.9.8-distilled-GGUF/tree/main

Dev model, works with V9.0: https://huggingface.co/QuantStack/LTXV-13B-0.9.8-dev-GGUF/tree/main

(see "Model Card" in above links for LTX 0.9.8 VAE and textencoder downloads)

V9.5: LTX 0.9.7 Distilled Workflow supporting LTX 0.9.7 Distilled GGUF Model.

There is a workflow with Florence and another one with LTX Prompt Enhancer (LTXPE)

GGUF Model can be downloaded here:

https://huggingface.co/wsbagnsv1/ltxv-13b-0.9.7-distilled-GGUF/tree/main

VAE and Textencoder are identical to previous LTX 0.9.6 model (see V8.0 below)

LTX 0.9.7 Distilled is using only 8 steps and is very fast.

V9.0: LTX 0.9.7 Workflow supporting LTX 0.9.7 GGUF Model.

There is a workflow with Florence and another one with LTX Prompt Enhancer (LTXPE)

GGUF Model can be downloaded here:

https://huggingface.co/wsbagnsv1/ltxv-13b-0.9.7-dev-GGUF/tree/main

VAE and Textencoder are identical to previous LTX 0.9.6 model (see V8.0 below)

LTX 0.9.7 is a 13billion parameter model, previous versions only had 2b parameters, therefore it is more heavy on Vram usage and requires longer process time. Try V8.0 below with model 0.9.6 or V9.5 for very fast rendering.

V8.0: LTX 0.9.6 Workflow (dev and distilled GGUF model in same workflow)

there is a version with Florence2 Caption and a version with LTX Prompt Enhancer (LTXPE)

GGUF Models (Dev & Distilled) can be downloaded here:

https://huggingface.co/calcuis/ltxv0.9.6-gguf/tree/main

vae: pig_video_enhanced_vae_fp32-f16.gguf

Textencoder: t5xxl_fp32-q4_0.gguf

V7.0: LTX 0.9.5 Model Version GGUF with Wavespeed/Teacache.

LTX 0.9.5 GGUF Model and VAE: https://huggingface.co/calcuis/ltxv-gguf/tree/main

(vae_ltxv0.9.5_fp8_e4m3fn.safetensors)

Clip Textencoder: https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main

There are 2 worklfows, a main workflow with florence caption only and additional one with florence and LTX prompt enhancer. Setup with Wavespeed (bypassed by default, Strg+B to activate)

workflow works with all GGUF models: 0.9 / 0.9.1 / 0.9.5

uncensored LLM for Prompt enhancer: https://huggingface.co/skshmjn/unsloth_llama-3.2-3B-instruct-uncenssored

-Outdated (march 2025)- V6.0: GGUF/TiledVAE Version & Masked Motion Blur Version

Updated the workflow with GGUF Models, which save Vram and run faster.

There is a Standard Version, which uses just the GGUF Models and a GGUF+TiledVae+Clear Vram Version, that reduces Vram requirements even further. Tested the larger GGUF model (Q8) with resolution of 1024, 161 frames and 32 steps , the GGUF Version peaked Vram usage at 14gb, while the TiledVae+ClearVram Version peaked at 7gb. Smaller GGUF Models might reduce requirements further.

GGUF Model, VAE and Textencoder can be downloaded here:

(Model&VAE): https://huggingface.co/calcuis/ltxv-gguf/tree/main

(anti Checkerboard Vae): https://huggingface.co/spacepxl/ltx-video-0.9-vae-finetune/tree/main

(Clip Textencoder): https://huggingface.co/city96/t5-v1_1-xxl-encoder-gguf/tree/main

You can go for the GGUF Version with 16gb+ and the TiledVae+ClearVram with less than 16gb Vram.

Masked Motion Blur Version: Since LTX is prone to motion blur, added an extra group to the workflow which allows to set a mask on input image, apply motion blur to mask, to trigger specific motion. (sounds better than it actually works, useful tho in some cases). GGUF and GGUF+TiledVAE+ClearVram version included.

V5.0: Support for new LTX Model 0.9.1.

included an additional workflow for LowVram (Clears Vram before VAE)
added a workflow to compare LTX Model 0.9.1 vs LTX Model 0.9

(V4 did not work with 0.9.1 when the model was released (hence v5 was created), this has changed as comfy & nodes were updated in the meantime, now you can use both Models (0.9 & 0.9.1) with V4, also with V5. Both have different custom nodes to manage the model, other than that, both versions are the same. If you run into memory issues/long process time, see tips at the end)

-Outdated (march 2025)- V4.0: Introducing Video/Clip extension :

Extend a clip based on last frame from previous clip. You can extend a clip about 2-3 times before quality starts to degenerate, see more details in the notes of the worflow.

Added a feature to use your own prompt and bypass florence caption.

V3.0: Introducing STG (Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling).

Included a SIMPLE and an ENHANCED workflow. Enhanced Version has additional features to upscale the Input Image, that can help in some cases. Recommend to use the SIMPLE Version.

replaced the height/width Node with a "Dimension" node that drives the Videosize (default = 768. increase to 1024 will improve resolution, but might reduce motion, also uses more VRAM and time). Unlike previous Versions, Image will not be cropped.
Included new node "LTX Apply Perturbed Attention" representing the STG settings (for more details on values/limits see the note within the workflow) .
Enhanced Version has an additional switch to upscale Input Image (true) or not (false). Plus a scale value (use 1 or 2) to define the size of the image before being injected, which can work a bit like supersampling. As said, not required in most cases.

Pro Tip: Beside using the CRF value at around 24 to drive movement, increase the frame rate in the yellow Video Combine node from 1 to 4+ to trigger further motion when outcome is too static.

Node "Modify LTX Model" will change the model within a session, if you switch to another worklfow, make sure to hit "Free model and node cache" in comfyui to avoid interferences. If you bypass this node (strg-B) , you can do Text2Video.

V2.0 ComfyUI Workflow for Image-to-Video with Florence2 Autocaption (v2.0)

This updated workflow integrates Florence2 for autocaptioning, replacing BLIP from version 1.0, and includes improved controls for tailoring prompts towards video-specific outputs.

New Features in v2.0

Florence2 Node Integration
Caption Customization
- A new text node allows replacing terms like "photo" or "image" in captions with "video" to align prompts more closely with video generation.

V1.0: Enhanced Motion with Compression

To mitigate "no-motion" artifacts in the LTX Video model:

Pass input images through FFmpeg using H.264 compression with a CRF of 20–30.
- This step introduces subtle artifacts, helping the model latch onto the input as video-like content.
- CRF values can be adjusted in the yellow "Video Combine" node (lower-left GUI).
- Higher values (25–30) increase motion effects; lower values (~20) retain more visual fidelity.

Autocaption Enhancement

Text nodes for Pre-Text and After-Text allow manual additions to captions.
- Use these to describe desired effects, such as camera movements.

Adjustable Input Settings

Width/Height & Scale: Define image resolution for the sampler (e.g., 768×512). A scale factor of 2 enables supersampling for higher-quality outputs. Use a scale value of 1 or 2. (changed to dimension node in V3)

Pro Tips

Motion Optimization: If outputs feel static, incrementally increase the CRF & frame rate value or adjust Pre-/After-Text nodes to emphasize motion-related prompts.
Fine-Tuning Captions: Experiment with Florence2’s caption detail levels for nuanced video prompts.
If you run into memory issues (OOM or extreme process time) try the following:
- use the LowVram version of V5
- use a GGUF Version
- press "free model and node cache" in comfyui
- set starting arguments for comfyui to --lowvram --disable-smart-memory
  - see the file in your comfyui folder: "run_nvidia_gpu.bat" edit the line: python.exe -s ComfyUI\main.py --lowvram --disable-smart-memory
- switch off hardware acceleration in your browser

Credits go to Lightricks for their incredible model and nodes:

https://www.lightricks.com/

https://github.com/Lightricks/ComfyUI-LTXVideo

https://github.com/logtd/ComfyUI-LTXTricks

Description

V7.0: Support for LTX Model 0.9.5 GGUF with florence prompting

V7.0: Support for LTX Model 0.9.5 GGUF with florence or prompt enhancer

FAQ

Comments (31)

purplerude643Mar 9, 2025

CivitAI

I have error like this when using prompt enhance

LTXVPromptEnhancer

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

GK_ArtistMar 10, 2025

i still have the same issue. I see it is a known issue, and some manually fix are available. https://github.com/Lightricks/ComfyUI-LTXVideo/issues/119

it is solved for me by doing the manual fix

JerryOverMar 11, 2025

You could bypass the node or delete it and use something like florence instead.

tremolo28

Author

Mar 11, 2025· 1 reaction

CivitAI

LTX Prompt Enhancer tips:

- use the blue node "Prompt for LTX prompt Enhancer" to describe general info about the clip. I.e. your clip is supposed to show the inside of the room with some camera movement, you can use something like "a video showing a room tour with subtile camera movement" or "an action video with heavy camera movement", etc.

- you can shorten the prompt by setting "max_resulting_token" with a lower value (default=256)

GK_ArtistMar 13, 2025· 1 reaction

CivitAI

created a nice video clip using this workflow. https://www.youtube.com/watch?v=lD-NICPYePM

[Folk Ballad and Spirituality] Sacred earth endless sky | Native American | GK Artist

CavernustMar 15, 2025· 1 reaction

CivitAI

best ltxv workflow so far. light clean and does the job. thanks keep it up!

negsol834Mar 15, 2025· 1 reaction

CivitAI

Thank you for the workflow!! I used version 7.0 and 0.9.5 to generate this music video on YT https://youtu.be/2LEDX-DtLNQ?si=scrdBi0QF8x1kqFZ

GK_ArtistMar 17, 2025

CivitAI

Alien Attack | a short AI movie | created with FREE local comfyui LTX and capcut and this WORKFLOW

https://youtube.com/shorts/mhDJyGrR2Y0

Just a short LTX video experiment running on local comfyui RTX 3080 10GB and edited with Capcut.

puchaukeMar 18, 2025

CivitAI

Hey!
Can you please tell me is there a setting responsible for "streinght" of movement? I had really nice results with previous version of your workflow but now movement is awful, random and far too much of it.

tremolo28

Author

Mar 18, 2025

there is not really a parameter for strength of movement. The new model (0.9.5) and the new nodes are much more trigger happy on motion. the following paramaters/nodes mainly drive motion:

LTXVPreprocess

Input image

prompt positive and negative

resolution (the higher, the less motion as a tendency)

JustcheckingModelsMar 21, 2025

What have you tried to far to spur stronger movement?

tremolo28

Author

Mar 21, 2025· 1 reaction

@JustcheckingModels I use LTX Prompt enhancer, as it tends give more motion. I use pre and after text as well. But I was not able to realy find a good method to have more control on what is happening by prompting. I mainly just spin the wheel until I get something that works for me.

CavernustMar 19, 2025

CivitAI

I'm not an expert, but there's something I still don't understand: since Florence analyzes the photo and generates a prompt, which is then adjusted by replacing words like "image" with "video," wouldn't it be more efficient to program it to describe images as if they were videos from the start? Even after replacing those words, it doesn't generate much motion and still describes the scene more like a video-photo, using many verbs that imply static conditions rather than dynamic ones. If Florence is designed to analyze the photo and generate a prompt, couldn't it be instructed to describe those photos as if they were videos?

tremolo28

Author

Mar 20, 2025· 1 reaction

think Florence is trained to caption images, so it sometimes has issues to predict motion, etc. relevant for a video clip.

LTX prompt enhancer does a better job in that respect, but it is censored and has it´s issues. You might need to stick to write an "own prompt". However, the LTX Model is probably more prone to the input image than to the prompt.

CavernustMar 20, 2025· 1 reaction

@tremolo28 yes i tried to make prompts with gpt too but it usually generates a lot of smear motion and deformities. there should be a custom prompter ai designed for ltxv but i haven't found any good ones. anyway if you get the prompt right you can get some very nice things. thanks again for this workflow!

JustcheckingModelsMar 21, 2025· 1 reaction

Hmm that gives me an idea, since LTX uses a T5 encoder, and T5 is a Learning Transfer model designed to be fine tuned. Theoretically someone could take all the prompts in this github list-> https://github.com/MushroomFleet/LLM-Base-Prompts/tree/main/LTXV-PromptGen/LTXV-PromptLists and train a finetuned T5 decoder that makes natively T5 prompts.

T5 has been integrated into image captioners too, but how you would daisy chain an image captioner to a prompt-maker is way over my head:
"This paper presents a model for image captioning that employs a vision Transformer model with re-attention as the encoder and a T5-based model as the decoder."
https://www.semanticscholar.org/paper/Deep-Vision-Transformer-and-T5-Based-for-Image-Lam-Nguyen/b64db58f383797526e5d9c490baff9d901cd8f04
"Captioner model is composed of CLIP image encoder and Flan-T5 decoder. Each input image is injected via cross-attention to the decoder, after a linear projection to fit the size of embedding."
https://github.com/DaehanKim/GCC-caption-lora

Just a thought.

JustcheckingModelsMar 21, 2025· 1 reaction

@Cavernust I find the best approach is start with a single-line prompt that simply describes who is doing what to whom, specifically the action you want "a boy eats soup with a spoon" "a car comes around the corner". And render it out with like only 18 frames for quick iteration. The other day I had great difficulty getting a barber to cut hair, maybe I should have said 'hairdresser'?

Because if LTX can't get the basic action right, then it doesn't matter how lyrically GPT waxes about how "the focus is on the car creating a sensation of both discovery and a trepidation of venturing into the unknown" you're gonna get smeary or glitchy motion. Once you can see it understands the movement/action then you can start adding on all the visual sugar, describing camera movement, haircuts, and furniture, and lighting.

CavernustMar 21, 2025

@JustcheckingModels Thanks! Very interesting, I'll take a look as soon as I can. I think that in the near future all these steps will be streamlined and automated for a more user-friendly experience, like on sites such as Kling, Minimax, etc. I understand that we're still in the early stages. I can still achieve great results with LTXV, even with Florence automatic after several rerolls. I made a video, and many clips were created using this workflow. https://youtu.be/YeDek0fSrSE?si=LzgHkeDFk-DtV3KU

CavernustMar 21, 2025· 2 reactions

CivitAI

many clips of my new video were made with this workflow <3 watch it here: https://youtu.be/YeDek0fSrSE?si=LzgHkeDFk-DtV3KU

GK_ArtistMar 21, 2025· 1 reaction

CivitAI

a new video clip for my music created with this workflow again.

https://www.youtube.com/watch?v=QLNwDB2AZe4

PeterCastlerMar 28, 2025· 4 reactions

CivitAI

Can someone please help? :( I am missing and can't find these two nodes:
STGGuider
LTXVApplySTG

Screenshot:

https://i.imgur.com/KgoXvXs.png

tremolo28

Author

Mar 28, 2025· 3 reactions

Both nodes are from this repo: https://github.com/Lightricks/ComfyUI-LTXVideo

PeterCastlerMar 28, 2025

@tremolo28 Aw cheers man, thanks for the reply!

EliteLensCraftApr 17, 2025· 3 reactions

CivitAI

LTXVideo 0.9.6 is out - yay!

tremolo28

Author

Apr 18, 2025· 1 reaction

Thanks, will look into it.

tremolo28

Author

Apr 18, 2025

uploaded a workflow to Experimental Tab with LTX 0.9.6 support

tremolo28

Author

Apr 18, 2025· 1 reaction

observations so far with model 0.9.6:

- framerate changed from 25fps to 30fps

-LTX preprocess node reacts strange, previously it added compression to trigger motion, now it moves away from the input image, the higher the value is. With like 40, it behaves more like Text 2 Video. Maybe better to be bypassed...

- optimal resolution seems to be 1216 × 704

- CFG is managed in "STG Guider Advanced" node, the node in GUI is obsolete

- more details under News: https://github.com/Lightricks/LTX-Video

tremolo28

Author

Apr 18, 2025

Set "LTXVPreprocess" node to zero or bypass it to avoid text and odd outcomes. Node is in the gui in the blue "make it move" section

loneillustratorApr 18, 2025

@tremolo28 @tremolo28 the comfyui update freezed it

yallapapiApr 20, 2025

nodes don't install from comfyui manager, running a 5090

tremolo28