LTX-2.3 Image-to-Video Workflow — QwenVL Auto-Prompt, No Drift

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
🎬 **LTX-2.3 Image-to-Video Workflow**
QwenVL Auto-Prompt · No Drift · ComfyUI
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Pure LTX 2.3 22B image-to-video pipeline for ComfyUI. Drop an image, get professional motion. QwenVL vision model automatically analyzes your input image and generates a motion-aware prompt—no manual description needed. The workflow enforces locked static camera (anti-drift), scales dynamically to any input resolution, and upscales output to broadcast-quality 1920×1088 at 24 FPS. Production-ready for stock footage, ambient loops, and commercial video generation.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✨ **Features**
✅ QwenVL Auto Motion Director — Vision model reads input image → auto-generates motion prompt with camera lock and object tracking hints
✅ Locked Static Camera — Zero pan, zoom, or drift; all motion in-frame only
✅ Pure LTX 2.3 22B — No LoRA needed; GGUF quantization for 16GB VRAM
✅ Dynamic Pixel Scaling — Auto-scales any input size to optimal 0.52MP for 8-step inference
✅ Dual-Stage Upscale — 960×544 base → 2× spatial upscaler → 1920×1088 output
✅ Audio + Video VAE — Multi-modal encoding; ready for synced audio pipelines
✅ 24 FPS Native — Smooth playback; 168 frames per generation

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📦 **Required Models** (6 files, ~32 GB)

• ltx-2.3-22b-distilled-Q4_K_M.gguf (17.8 GB) — Main UNet diffusion model (GGUF Q4 quantized)
• gemma_3_12B_it_fp4_mixed.safetensors (9.45 GB) — Text encoder for LTX prompt understanding
• ltx-2.3_text_projection_bf16.safetensors (2.31 GB) — Text-to-latent projection layer
• LTX23_video_vae_bf16.safetensors (1.45 GB) — Video VAE codec (encode/decode video frames)
• LTX23_audio_vae_bf16.safetensors (365 MB) — Audio VAE codec (dual-modal support)
• ltx-2.3-spatial-upscaler-x2-1.1.safetensors (996 MB) — 2× spatial upscaler for final quality pass

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⬇️ **Download Links** (verified HuggingFace)

📁 **ComfyUI/models/unet/**
• LTX-2.3-22B-distilled-1.1-Q4_K_M.gguf (17.8 GB) — https://huggingface.co/QuantStack/LTX-2.3-GGUF

📁 **ComfyUI/models/text_encoders/**
• gemma_3_12B_it_fp4_mixed.safetensors (9.45 GB) — https://huggingface.co/Comfy-Org/ltx-2
• ltx-2.3_text_projection_bf16.safetensors (2.31 GB) — https://huggingface.co/Kijai/LTX2.3_comfy

📁 **ComfyUI/models/vae/**
• LTX23_video_vae_bf16.safetensors (1.45 GB) — https://huggingface.co/Kijai/LTX2.3_comfy
• LTX23_audio_vae_bf16.safetensors (365 MB) — https://huggingface.co/Kijai/LTX2.3_comfy

📁 **ComfyUI/models/upscale_models/**
• ltx-2.3-spatial-upscaler-x2-1.1.safetensors (996 MB) — https://huggingface.co/Lightricks/LTX-2.3

⚠️ *VAE files are NOT in the official Lightricks repo — get them from Kijai/LTX2.3_comfy. Gemma fp4 encoder hosted by Comfy-Org. Filenames use v1.1 (current stable hotfix release).*

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🧩 **Required Custom Nodes**

• LTXV — Lightricks LTX-Video extension (sampling, encoding, projection)
• AILab_QwenVL_Advanced — QwenVL vision model integration for image-to-text
• ComfyUI-GGUF — UnetLoaderGGUF for quantized model loading
• VideoHelperSuite — VHS_VideoCombine, frame batching, video output export
• rgthree-comfy — Fast Groups Bypasser (optional; used for workflow flexibility)
• ImageIterator — Batch image loader for multi-image workflows
• ImageScaleToTotalPixels — Dynamic resolution scaling to pixel budget
• GetImageSize+ — Image dimension detection for auto-scaling pipeline

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

🚀 **How to Use**

1. Place your input image(s) in the ComfyUI ./input directory
2. Load this workflow into ComfyUI
3. (Optional) Review the auto-generated motion prompt in the QwenVL output text node
4. Queue and generate
5. Output video saved via VHS to ./output directory

The entire motion prompt generation and scaling pipeline runs automatically—queue once, get your result.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚙️ **Settings & Parameters**

• FPS — 24 (Standard frame rate; 168 total frames per generation)
• Pixel Budget — 0.52 MP (Optimal for 8-step sampling on 16GB VRAM)
• Sampler — er_sde (Low-drift SDE solver for stable motion)
• Base Steps — 8 (Main diffusion sampling passes)
• Refine Steps — 3 (Quality refinement after upscale)
• CFG Scale — 1.0 (Classifier-free guidance; 1.0 = no guidance, stable output)
• Output Resolution — 1920×1088 (After 2× spatial upscale)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

💡 **Performance Tips**

• Batch Multiple Images — Queue 5–10 images in one session to amortize model load time
• Input Image Quality — Sharp, well-lit images yield sharper motion; low-contrast images may produce soft motion
• Motion Prompt Tuning — Edit the QwenVL text output node before queuing if you want specific motion direction (e.g., remove camera keywords to force static)
• Speed vs. Quality — The dual-stage upscale adds ~20 seconds per clip. Bypass the Spatial Upscaler node if speed is critical (output at 960×544)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

📝 **Notes & AI Disclosure**

• AI-Generated Content — All example outputs are AI-generated by LTX 2.3. Suitable for stock footage, ambient loops, and creative projects.
• Model Downloads — See the "Download Links" section above for exact HuggingFace repos and target folders.
• Hardware Tested — RTX 5080 16GB VRAM; CUDA compute 9.2+
• VRAM Usage — ~14 GB peak during sampling; requires fast SSD for frame buffering
• No Commercial Guarantees — Use at your own discretion. Respect local AI disclosure laws when publishing outputs.

Enjoy clean, drift-free motion generation. Questions? Test the workflow locally first—Civitai comments section is for feedback, not troubleshooting.

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚖️ **Model Attribution & Licensing**

**LTX-Video 2.3** (Lightricks)
• License: LTX-2 Community License — https://huggingface.co/Lightricks/LTX-2.3/blob/main/LICENSE
• Free for commercial use by entities under $10M USD annual revenue
• AI-generated content disclosure required

**Gemma 3 12B IT** (Google DeepMind)
• License: Gemma Terms of Use — https://ai.google.dev/gemma/terms
• Subject to Google's Prohibited Use Policy

**Custom Nodes**
• LTXV (Lightricks), VideoHelperSuite (MIT), AILab QwenVL, rgthree-comfy (MIT), ComfyUI-GGUF

All example outputs are AI-generated. This workflow (JSON configuration) is shared as original work; model weights must be downloaded separately from the official sources above.
Description

Details

Files

ltx23ImageToVideoWorkflow_v10.json

Mirrors