LTX2.3 FP4 Models
Because no one uploads FP4 here. and they should (fp4 works on 3xxx and above for vram savings.)
LTX2.3 Distilled FP4ME - Distilled FP4 Mixed Extreme - 14.1GB**
LTX2.3 Full Dev FP4ME - Full Dev FP4 Mixed Extreme - 14.1GB**
LTX2.3 Official NVFP4 - From Lightricks - Updated 3/17/2026 - 21GB
LTX2.3 Dev NVFP4 - from Hippotes / LTX-2.3-various-formats - 18.15GB
LTX2.3 Dev FP4 STFO - from Kijai's Transformers only Scaled FP8 - 16.63GB**
LTX2.3 Distilled FP4 - My first FP4 - Transformers Only - 18.15GB**
** Are transformers only, or I broke vae/text projection :) needs seperate Vae and Text projection downloads.
I have a fix for lora issue on FP4ME models. Comfy Bathroom, custom lora loader with presets as well as custom config. https://huggingface.co/MrReclusive/LTX-2.3-FP4
Diffusion model loader/unet loader seems to get confused with these, use the checkpoint loader even though it has no clip/vae
Description
FAQ
Comments (56)
how to load LTX2.3 Dev FP4 Extreme 14.13G in comfyui, I use load diffusion model, but show error:RuntimeError: mat1 and mat2 shapes cannot be multiplied (10608x4096 and 2048x4096)
use checkpoint loader, even though it has no vae/clip, the unet loader seems to get confused.
ok, thank you, seem dark for this model
@xueqing12211 play with steps and samplers/schedulers, even with the full model with distilled lora at 0.65, still found it happy at 6 steps
can you upload your workflow for LTX2.3 Dev FP4 Extreme, thx a lot. I make not as good as your show
@xueqing12211 heres my most recent that had less stuff to turn off, lol
https://civitai.com/images/124094131
just download video and drag into comfyui.
if you are running full dev extreme, without the distilled lora, you will want to uncheck custom sigma's on the low res gen, and set scheduler and steps.
if you run the distilled lora, the current sigma's should be good.
@MrReclusive666 @MrReclusive thx what your help!
I don't get what's extreme etc... how are these different from a vanilla nvfp4?
@ForeverNecessary737716 the size, normally fp4 quantization puts these at around 19gb, i ignored the rules and pushed it further, so its 14gb, is it going to be as good as a normal fp4 or fp8, doubt it, but it does work, and pretty well, almost everything I am doing is running from that 14gb model.
normally you don't quantize first or last blocks, or attention layers, etc.
i quantized all of it, not all of it to fp4, i quantized first/last, and attention layers to fp8, but everything else, yeah, fp4.
lol after a million math errors i gave up on this model entirely
@sneedingonmyligma420 those math errors are a comfy issue, the unet/diffusion loaders don't support ltx2.3 yet, has to be loaded in checkpoint loader.
@MrReclusive666 i believe it, the official ltx clip loader searches for the text encoder in unet models. though i'm able to run ltx2.3 gguf'd just fine i get the feeling it's not 100% working as intended anyway.
@sneedingonmyligma420 gguf loaders are different, run on entirely separate segment of code, the most recent nightly build of comfy completely broke my ability to load any gguf at all atm. could probably fix it, but haven't loaded a gguf in 8 months. so meh.
@MrReclusive666 many such cases with comfyui lately lol. everything breaks in some way for different people. i've reinstalled multiple times in the past 3 months in the wake of all the new model releases/nuked my venv when a recent update completely nuked my sage attention install.
@sneedingonmyligma420 yeah, my sage is currently dead, im waiting until i need to stop updating every couple of days before i fix it. i was fine till this new dynamic memory bull crap...
the app has me curious though, they have added so much on screen crap its hard to use comfy on a phone now, ill see if the new app feature works on mobile.
Can you run FP4 on a 9070XT 16gb? Or its just NVFP4 for nvidia gpus?
I have no idea, some one would have to let me know if fp4 works on amd.
I would test, but my last working amd card is a R9-295X2, had to switch to nvidia because cuda/iray, but did always prefer amd/ati over nvidia.. im still bitter from nvidia buying 3dfx, lol.
If it does work it would likely be slow as dirt and violate your ram, even on an nvidia card nvfp4 models did that and locked my system up until I got newer cuda/torch. you can certainly try if you feel lucky but don't get your hopes up
@FloatsYourStoat that sounds to me like it expanded back out to fp8 when trying to run, which yeah, that will rape your system as when that typically happens its fp4 > bf16 > fp8. so yeah, 14gb to 46gb to 28gb, that will eat your ram and harddrive for lunch and still come back for seconds.
the NVFP4 quantization is working best with the FP4 tensors which only the Blackwell family of Nvidia has (RTX50..). For any other Nvidia generation, there is going to be extra calculations. But even so, I have a Blackwell GPU and I just tried a NVFP4 quantization of LTX 2.3, and the quality suffered really bad. In general, I think that the fp8-scaled is much better, even in flux2.klein, with which I have ben experimenting extensively.
@aferventu807 the extra calculation is literally expand layer to fp8, run, discard. which is faster then hd/ram to gpu on the fly.
and i would be curious about how you ran the nvfp4.
in my testing, on a 4090, didn't see much of a difference between nvfp4 and fp8, other then the fp4 ran almost twice as fast because it can fit on the gpu.
@MrReclusive666 talking with chatgpt about it and from my experiments, the CPU, the PCIe generation and the latency which comes with this, matters. For me, even if I have a Blackwell GPU, I have an old I9-9900 CPU and PCIE-3. So I guess, you can't put everything under a hood, concerning speed. There are a lot of parameters. But one is for sure, that NVFP4 is premature and it is quantized for the fp4 tensors of Blackwell, every other generation would have the overhead of calculations to convert it to fp8, if they even support it.
Anyways, it's just try and error. See what fits you and has the best quality and speed and go with it.
@aferventu807 yeah, its all about your own needs and experience.
i just tend to stand in the corner of fp4/nvfp4 right now because it solves issues with these larger models, like the fact that it allows the model to actually fully fit on 16gb vram.
and your pcie gen3 thing kind of reinforces the idea that letting the model dequantize on the gpu instead of "streaming" fp8 weights from cpu/ram, the dequant overhead is less time consuming on the gpu itself then streaming the model from cpu, but if you going to stream in either case, yeah, just stick fp8 or even fp16, but i could also argue, the time it takes for the fp4 weight to transfer to gpu would be faster then fp8 because its half the size.
its just in my case, it makes more since to load full model to gpu and let it dequantize on the fly, but i can do that because mutli gpu, the transformer has a gpu all to itself, vae and gemma sit on other gpu's.
but i have tried the fp8 and didn't see a difference in quality (unless you are using loras, lora's overpower fp4), and fp4 ran faster because well, fp8 was 29gb, my fp4 is 13gb..
@MrReclusive666 Oh yes, I can confirm that FP4 has problems with LORAs. In flux2klein, I was using mainly fp4 and I did a comparison with a character LORA I was making. FP4 was deforming the character. I did the comparison with the full precision model and fp8 and it was obvious. Now I am only using fp8.
Does multi-gpu node works again? I also have a second GPU but for a long time the node didn't work. I will give it a try again for the text encoder. Thanks!
@aferventu807 i tried updating to the official updated multigpu, it works fine, but, I went back to an older version and made a change myself, rather then disabling/bypassing "dynamic memory" i made it toggle per node, so i could run clip/text encoder with dynamic memory on one gpu, but leave the transformer without dynamic memory on another (i use dis-torch to control it manually) but the updated one did work, i just wanted control.
and for the lora fp4 thing, yeah, i spent to much time finding a solution to that in my fp4me models, and it wasn't even really an issue with full step generation, it was low step gens, like the 3 step upscale pass, lora's were basically breaking the distillation, not allowing it to run in 3 steps, even though it can with no lora. after a lot of trial and error and testing different layers, different blocks, different weights, etc. turned out to be simple, just turn off a few blocks and ramp a few others. also, never and i mean NEVER run a lora at 100% on fp4, those bf16 weights just demolish the fp4 weights, so basically you are just running pure lora, i found strengths between 25 and 50 to be ideal (with the ramp).
ltx1 lora's in ltx 2.3 was the odd one, i basically disabled half the lora and ran the other half at 15% and it feels full strength, the last few videos i did, with jinx, thats ltx1 lora at 15% with half of it disabled, on 2.3 fp4, the ltx2 jinx lora was also presesnt, but even it was only at 25% and mostly just some refinement that wasn't in the ltx1 lora.
@MrReclusive666 To be honest, I don't have much experience with LTX2.3. The last month I was consumed in creating my character LORA with flux2.klein. It was a nightmare but also intriguing, since there is not much experience in the community with flux2klein character training. In the first 2 weeks, I was testing every training with nvfp4 and likeness was low, until I tested the fp8 and BF16, and I realized how much time I had wasted. About the weights you were talking about, I don't think it applies to flux2klein-distilled, since I get more likeness with 1.2. Anything less than 1, goes down the drain, unless you use a reference image as well, which makes inference slower. Now I try to enrich my dataset, using the LORA and reference images of the original dataset.
But back to LTX2.3, my inference timings with fp8 transformer only+distilled LORA, is for every 121 frames (720p) less than a minute (with sage attention and without counting the vae decoding and not using latent upscaler). I am ok with that. I have worked with the model really not that much. Maybe you have some tips about the slow motion and the background music that sometimes is introduced in the clip, without prompting for that. I tried several prompts for a specific video (i2v), also with the help of chatgpt and gemini, but I couldn't get rid of this problem.
@aferventu807 yeah, the music thing, comes up a lot if you prompt "cinematic", but i have found ltx2.3 responds to no's in the prompt (so you can run with cfg 1), so most the time im prompting "cinematic no music" in the prompt, ill also add in things like "just the sound of x and y", but it still happens, it was trained on a lot of movie scenes, so, a lot of music in the training data. and yeah, ltx is fast with low frames, im just the dumbass running 720 frames. and for slowmo, honestly, that mostly comes from not upscaling, run a generation at 720p, (704p, fucking ltx) and then run same gen at half the resolution with same prompt and seed, it will be in correct speed.
hunyuan has the same issue, higher the res, the slower the motion gets, but low res + upscale works because motion exists from the lower stage, its why both ltx and hyv1.5 gave an upscale model, because its the intended workflow, its the dit design, the more time it spends on detail, the less it does with motion. it seems like it would take more time, but in general its faster, 640x352 will run 121 frames in about 15 seconds for 8-9 steps, upscale only needs 3 steps to clean up the upscale, so it runs pretty fast.
and I haven't trained specifically for flux 2 klein, but did train for flux 2 and just use that on flux 2 klein, it works on fp4 fine for me, also don't train against fp4, that i know is a bad idea. best to train against full models, or fp8 if you can't run full bf16
@MrReclusive666I will try the upscaler, although I had bad results with LTX2.2, picture-quality wise. But, for the same video I was struggling with the slo-mo, I tried a lower resolution just to test it for slo-mo without having to wait for the "704p", and you are actually right, the movement was normal. The video was just a guy walking in the street, nothing complicated.
About klein and fp8 vs fp4, the speed difference is minimal (2-3 secs), and with fp4 for some reason my CPU works overtime. According to chatgpt, this happens because the cpu has to "orchestrate the quantization" or something like that, so it doesn't really matter. Was your flux2.dev LORA, which you used with klein, a character LORA?
@aferventu807 yes, it does use the cpu a lot because the dequant, one of my experimental samplers I made to turn LTX into an moe, I've removed that cpu hit, its all handled by the gpu.
also, the dequant goes from fp4 to bf16 normally, im playing with fp4 to fp8 to remove some of the overhead (not an issue for Blackwell as it natively runs fp4)
i also need to make a save video node that actually uses the gpu, lol, always hated how all the video creation nodes are cpu bound, fine for short clips, but once your pushing 30 seconds that can take a bit, but it would be almost instant on gpu.
@MrReclusive666 I have a Blackwell and it still hit the CPU. It may have to do with Comfy memory management or just that fp4 is just still immature. Maybe it is just an immature format from NVIDIA to say "we got you all you suckers with Blackwell". LOL
@aferventu807 yeah, really shouldn't be hitting cpu like that on a blackwell, it should nativly run in fp4, so no dequant.
but that memory management thing that added in comfy is broke as crap, i run with it off that is likely messing with it, as comfy doesn't actually support fp4, it's kind of shoehorned in by comfy kitchen.
@MrReclusive666 I also have problems with my flux-fill workflow, which I use very often, which is really underrated from the community. Comfy is messing everything up with every update, that is why I have 3 different versions of comfy portable installed.
@aferventu807 Out of curiosity, are you up to date? most importantly cuda 130+? nvfp4 support requires cuda 130 to function properly as a minimum, the comfyui logs will complain about it on launch if you don't have it
@FloatsYourStoat thanks for asking. I work with the latest pytorch 2.11 and cud310, otherwise comfy-kichen wouldn't work and I wouldn't be able to use nvfp4. I also compile my own wheels for sage-attention and flash-attention especially for sm120 (Blackwell architecture).
so what would the difference be between this and a gguf? since both are quantized? this are slightly bigger than ggufs as well.
this vs gguf, is this is still the full model, those smaller gguf's loose a lot of quality as they get smaller, but mainly, fp4's are for people trying to avoid gguf swap, which is why i made a 14gb version, on 24gb vram you can. have full model on gpu while doing 20 second 1280x720 generations, so it runs faster.
but for those fine with gguf's, use gguf's, i use fp4 to avoid swap, but i have multi gpu setup, so vae has its own gpu, clip/gemma3 has its own gpu, and transformers have there own gpu. but thats why i can do 1536x768 20 seconds in 2 and a half minutes on a 4090, because transformer has a 4090, gemma/clip on a 4080, vae on another 4090.
i can't run that fast on gguf, specially with lora's because gguf wont ingest the lora like fp4 will, it has to actively swap, in fp4 the lora just becomes part of the model like in fp8/fp16/bf16.
but, also, if you have a 5xxx series card, these are supposed to be like twice as fast, but I wouldn't know.
@MrReclusive666 i tried the extreme distill and dev on my 4070. and surprisingly they are much faster than the GGUFs. great work on these models
@brotherel thanks, my entire reason for these is to avoid or drastically lower data swapping while running, and even if there is data swap, that data is compressed so, faster swaps.
gguf's should be comparable in speed, up until you add a lora, then gguf's slow down.
The quality is really bad on my out put. Looks like it needed some more steps. But idk where to change the steps in LTX workflows^^
there is a lot of weird things i've noticed in ltx2.3, upscale breaks if you have lora's, and steps/sigmas are very fickle. and its not just the fp4's, i tried on fp8 as well, i do believe lora's are fine on gguf's though, as gguf's don't actually inject the lora.
most likely the workflow you have, isn't just doing "step" count, its custom sigmas.
@MrReclusive666 I tried your workflow but i find it very confusing
@1015 lol, yeah, i run multigpu's, so, not ideal for most people. Once im done tweaking my flow, i may make a single gpu version and share it, i finaly, as of about an hour ago, fixed my issue with loras, required custom node, but its working with 2 stage upscale.
LTX usually has a comma-separated list of values (CSV), the values are the sigmas (e.g. 0.123, 4.567, 7.890). The number of separate numbers (three in my example) is the number of 'steps', i.e. the point at which is starts over. you should just keep the numbers that are set, unless you change to a different sigma/scheduler input, in which case it will have to be LTX-aware, I think. I tried one and it was OK (can't remember which workflow it was, sorry). But the default, CSV numbers, one was just as good. so it's probably not the steps, but happy to be corrected.
@Cochese9000 yeah, most people should be running custom sigma's, but you can get decent results with beta11/sgm uniform/ays 30+.
but yes, once you get your workflow going, always custom sigma.
also, watch out for differences in sigma nodes.
your sigma's would explode mine, lol, different node perhaps.
here's my first stage, anything above 1 is just a wasted step as 1 is full noise generation in the sampler.
1, 0.925, 0.909375, 0.875, 0.725, 0.421875, 0.0
Please tell me what to download for the RTX 5060ti?
how much vram does your card have ?
if its the 16gb model, the extreme's should work, but not much headroom, if 8gb.. yeah, not sure how any of these would run on 8gb, you could do it, but you would really heavily on dynamic vram or a huge offset in distorch.
@MrReclusive666 I'm afraid you're wrong. My 16GB video card renders a 10-second 1920x1080 video in about 3-4 minutes. The 23GB model. A good workflow makes all the difference.
@Renessance okay.. how was i wrong? i said it could run on 16gb...
when i reference things like head room and all that, I am referring to operating without any offset/swapping, because yes, my goal, for my setups, is no swapping of model while running, but its also why I can do 1920x1088 10 seconds in a little over a minute.. (1080 isn't possible in ltx btw), i regularly push 1920x960 @ 30 seconds in less then 5 minutes, yes, a good workflow makes a lot of difference.
@MrReclusive666
Perhaps your phrase was poorly translated by Google Translate. I wrote you a response from the translator.
@Renessance Okay, My apologies, i use google translate myself for work, so i get it.
basically, use the extreme variants, they are smaller, so even if you are data swapping, they should be faster because the blocks are smaller.
@EndlessShadow nope...
https://www.techpowerup.com/gpu-specs/geforce-rtx-5060-ti-8-gb.c4246
also, remember the 3x series, god that was a mess, nvidia has no idea how to build budget gpu's
@MrReclusive666 Oh... That's good I didn't buy one. Thought they were all 16gb if they were the "ti".
@EndlessShadow no, but the ti is still "better" even if 8gb, faster vram, higher cuda core count, but, that 8gb really hurts when doing anything needing a lot of vram. it was like the 3x series, they had the 12gb 3060 that was beat in every benchmark by the 8gb 3070, unless you needed that extra 4gb. its not all about vram, there is a lot of differences between the model numbers, and he ti always kind of falls in a weird place between 2 versions, generally made from failed chips from the higher end models.
1280*736,10s,60fps,5060ti-16g,感觉至少半个小时才能出来,不过我搞不懂,它为啥能运行,wan2.2 ani这个设置基本就报错了,ltx的模型也不小,但它没有oom,真神奇