I improved as best I could and made it use video to video and have had amazing results. With my RTX 5090 with 64gb ram I can make these in under a minute.
🧠Model Configuration Overview
🔹 Base UNet
• Model: ltx-2-3-22b-dev-Q4_K_M.gguf
• Type: UNet (quantized GGUF)
⸻
🔹 Distilled LoRA
• LoRA: ltx-2.3-22b-distilled-lora-dynamic_fro09_avg_rank_105_bf16.safetensors
• Strength: 0.60
• Type: Distilled LoRA (bf16)
⸻
🔹 Text Encoders (Dual CLIP)
• CLIP 1: gemma_3_12B_it_fp4_mixed.safetensors
• Type: Text Encoder (Gemma, FP4 mixed)
• CLIP 2: ltx-2.3_text_projection_bf16.safetensors
• Type: Text Projection (bf16)
• Mode: ltxv
⸻
🔹 Audio VAE
• Model: LTX23_audio_vae_bf16.safetensors
• Device: main_device
• Precision: bf16
• Type: Audio VAE
⸻
🔹 Video VAE
• Model: LTX23_video_vae_bf16.safetensors
• Device: main_device
• Precision: bf16
• Type: Video VAE
⸻
🔹 Upscaler
• Model: ltx-2.3-spatial-upscaler-x2-1.1.safetensors
• Type: Spatial Upscaler (x2)