A comprehensive production-grade pipeline designed for the LTX-2 model. It specializes in generating high-fidelity video by combining a source image and an audio track to create synchronized content, such as music videos with lip-syncing and dance.
Key Features & Architecture
The workflow is organized into distinct logical stages using subgraphs to manage complexity and optimize hardware resources:
Multimodal Input Processing:
Image Handling: Uses
ImageResizeKJv2to prepare a source image, which acts as the visual foundation for the video.Audio Integration: Employs a
VHS_LoadAudioUploadnode to bring in external audio files, which guide the timing and motion of the generation.
Dual-Stage Sampling Pipeline:
Stage 1 (Initial Generation): Focuses on establishing the core motion and structure.
Stage 2 (Refinement): A secondary pass that refines the video and audio latents for higher quality.
VRAM Optimization:
Gemma API Text Encode: Instead of loading the massive Gemma-3 12B model locally, this workflow uses an API-based text encoder. This significantly reduces local VRAM requirements, allowing the workflow to run on GPUs with as little as 12GB to 16GB.
Creative Controls:
Camera LoRAs: Includes dedicated slots for LTX-2 Camera Control LoRAs (e.g., Dolly Left), allowing for precise cinematic movement.
Latent Upscaling: Incorporates a spatial upscaler to enhance the resolution of the final output.
