CivArchive
    LTX-2.3 Image-to-Video Workflow — QwenVL Auto-Prompt, No Drift - v1.0
    NSFW
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🎬 **LTX-2.3 Image-to-Video Workflow**
    QwenVL Auto-Prompt · No Drift · ComfyUI
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    Pure LTX 2.3 22B image-to-video pipeline for ComfyUI. Drop an image, get professional motion. QwenVL vision model automatically analyzes your input image and generates a motion-aware prompt—no manual description needed. The workflow enforces locked static camera (anti-drift), scales dynamically to any input resolution, and upscales output to broadcast-quality 1920×1088 at 24 FPS. Production-ready for stock footage, ambient loops, and commercial video generation.
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    ✨ **Features**
    ✅ QwenVL Auto Motion Director — Vision model reads input image → auto-generates motion prompt with camera lock and object tracking hints
    ✅ Locked Static Camera — Zero pan, zoom, or drift; all motion in-frame only
    ✅ Pure LTX 2.3 22B — No LoRA needed; GGUF quantization for 16GB VRAM
    ✅ Dynamic Pixel Scaling — Auto-scales any input size to optimal 0.52MP for 8-step inference
    ✅ Dual-Stage Upscale — 960×544 base → 2× spatial upscaler → 1920×1088 output
    ✅ Audio + Video VAE — Multi-modal encoding; ready for synced audio pipelines
    ✅ 24 FPS Native — Smooth playback; 168 frames per generation
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    📦 **Required Models** (6 files, ~32 GB)
    
    • ltx-2.3-22b-distilled-Q4_K_M.gguf (17.8 GB) — Main UNet diffusion model (GGUF Q4 quantized)
    • gemma_3_12B_it_fp4_mixed.safetensors (9.45 GB) — Text encoder for LTX prompt understanding
    • ltx-2.3_text_projection_bf16.safetensors (2.31 GB) — Text-to-latent projection layer
    • LTX23_video_vae_bf16.safetensors (1.45 GB) — Video VAE codec (encode/decode video frames)
    • LTX23_audio_vae_bf16.safetensors (365 MB) — Audio VAE codec (dual-modal support)
    • ltx-2.3-spatial-upscaler-x2-1.1.safetensors (996 MB) — 2× spatial upscaler for final quality pass
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    ⬇️ **Download Links** (verified HuggingFace)
    
    📁 **ComfyUI/models/unet/**
    • LTX-2.3-22B-distilled-1.1-Q4_K_M.gguf (17.8 GB) — https://huggingface.co/QuantStack/LTX-2.3-GGUF
    
    📁 **ComfyUI/models/text_encoders/**
    • gemma_3_12B_it_fp4_mixed.safetensors (9.45 GB) — https://huggingface.co/Comfy-Org/ltx-2
    • ltx-2.3_text_projection_bf16.safetensors (2.31 GB) — https://huggingface.co/Kijai/LTX2.3_comfy
    
    📁 **ComfyUI/models/vae/**
    • LTX23_video_vae_bf16.safetensors (1.45 GB) — https://huggingface.co/Kijai/LTX2.3_comfy
    • LTX23_audio_vae_bf16.safetensors (365 MB) — https://huggingface.co/Kijai/LTX2.3_comfy
    
    📁 **ComfyUI/models/upscale_models/**
    • ltx-2.3-spatial-upscaler-x2-1.1.safetensors (996 MB) — https://huggingface.co/Lightricks/LTX-2.3
    
    ⚠️ *VAE files are NOT in the official Lightricks repo — get them from Kijai/LTX2.3_comfy. Gemma fp4 encoder hosted by Comfy-Org. Filenames use v1.1 (current stable hotfix release).*
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    🧩 **Required Custom Nodes**
    
    • LTXV — Lightricks LTX-Video extension (sampling, encoding, projection)
    • AILab_QwenVL_Advanced — QwenVL vision model integration for image-to-text
    • ComfyUI-GGUF — UnetLoaderGGUF for quantized model loading
    • VideoHelperSuite — VHS_VideoCombine, frame batching, video output export
    • rgthree-comfy — Fast Groups Bypasser (optional; used for workflow flexibility)
    • ImageIterator — Batch image loader for multi-image workflows
    • ImageScaleToTotalPixels — Dynamic resolution scaling to pixel budget
    • GetImageSize+ — Image dimension detection for auto-scaling pipeline
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    🚀 **How to Use**
    
    1. Place your input image(s) in the ComfyUI ./input directory
    2. Load this workflow into ComfyUI
    3. (Optional) Review the auto-generated motion prompt in the QwenVL output text node
    4. Queue and generate
    5. Output video saved via VHS to ./output directory
    
    The entire motion prompt generation and scaling pipeline runs automatically—queue once, get your result.
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    ⚙️ **Settings & Parameters**
    
    • FPS — 24 (Standard frame rate; 168 total frames per generation)
    • Pixel Budget — 0.52 MP (Optimal for 8-step sampling on 16GB VRAM)
    • Sampler — er_sde (Low-drift SDE solver for stable motion)
    • Base Steps — 8 (Main diffusion sampling passes)
    • Refine Steps — 3 (Quality refinement after upscale)
    • CFG Scale — 1.0 (Classifier-free guidance; 1.0 = no guidance, stable output)
    • Output Resolution — 1920×1088 (After 2× spatial upscale)
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    💡 **Performance Tips**
    
    • Batch Multiple Images — Queue 5–10 images in one session to amortize model load time
    • Input Image Quality — Sharp, well-lit images yield sharper motion; low-contrast images may produce soft motion
    • Motion Prompt Tuning — Edit the QwenVL text output node before queuing if you want specific motion direction (e.g., remove camera keywords to force static)
    • Speed vs. Quality — The dual-stage upscale adds ~20 seconds per clip. Bypass the Spatial Upscaler node if speed is critical (output at 960×544)
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    📝 **Notes & AI Disclosure**
    
    • AI-Generated Content — All example outputs are AI-generated by LTX 2.3. Suitable for stock footage, ambient loops, and creative projects.
    • Model Downloads — See the "Download Links" section above for exact HuggingFace repos and target folders.
    • Hardware Tested — RTX 5080 16GB VRAM; CUDA compute 9.2+
    • VRAM Usage — ~14 GB peak during sampling; requires fast SSD for frame buffering
    • No Commercial Guarantees — Use at your own discretion. Respect local AI disclosure laws when publishing outputs.
    
    Enjoy clean, drift-free motion generation. Questions? Test the workflow locally first—Civitai comments section is for feedback, not troubleshooting.
    
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    
    ⚖️ **Model Attribution & Licensing**
    
    **LTX-Video 2.3** (Lightricks)
    • License: LTX-2 Community License — https://huggingface.co/Lightricks/LTX-2.3/blob/main/LICENSE
    • Free for commercial use by entities under $10M USD annual revenue
    • AI-generated content disclosure required
    
    **Gemma 3 12B IT** (Google DeepMind)
    • License: Gemma Terms of Use — https://ai.google.dev/gemma/terms
    • Subject to Google's Prohibited Use Policy
    
    **Custom Nodes**
    • LTXV (Lightricks), VideoHelperSuite (MIT), AILab QwenVL, rgthree-comfy (MIT), ComfyUI-GGUF
    
    All example outputs are AI-generated. This workflow (JSON configuration) is shared as original work; model weights must be downloaded separately from the official sources above.
    

    Description

    Initial release. Pure LTX 2.3 22B i2v, QwenVL auto-prompt, locked camera, dual-stage upscale to 1920x1088.

    Workflows
    LTXV 2.3

    Details

    Downloads
    71
    Platform
    CivitAI
    Platform Status
    Available
    Created
    6/25/2026
    Updated
    6/26/2026
    Deleted
    -

    Files

    ltx23ImageToVideoWorkflow_v10.json

    Mirrors