LTX-2
This model card focuses on the LTX-2 model, as presented in the paper LTX-2: Efficient Joint Audio-Visual Foundation Model. The codebase is available here.
LTX-2 is a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. It brings together the core building blocks of modern video generation, with open weights and a focus on practical, local execution.
Direct use license
You can use the models - full, distilled, upscalers and any derivatives of the models - for purposes under the license.
ComfyUI
We recommend you use the built-in LTXVideo nodes that can be found in the ComfyUI Manager. For manual installation information, please refer to our documentation site.
General tips:
Width & height settings must be divisible by 32. Frame count must be divisible by 8 + 1.
In case the resolution or number of frames are not divisible by 32 or 8 + 1, the input should be padded with -1 and then cropped to the desired resolution and number of frames.
For tips on writing effective prompts, please visit our Prompting guide
Limitations
This model is not intended or able to provide factual information.
As a statistical model this checkpoint might amplify existing societal biases.
The model may fail to generate videos that matches the prompts perfectly.
Prompt following is heavily influenced by the prompting-style.
The model may generate content that is inappropriate or offensive.
When generating audio without speech, the audio may be of lower quality.
HuggingFace Repository
Description
The full model, flexible and trainable in bf16
