SDXL 4 Step (FP32 with Improved UNET)
Note I used the 32bit CLIP-G
Refiner used in workflow also 32bit GGUF
Updated CLIP-L
For 4 Step use at CFG 1.0 - Load a image for workflow
For NSFW images the refiner should not be used
BRSGAN 2x can be found on Google Drive
2GB GGUF from FP32
Note 2GB requires separate clip and GGUF support, 4GB FP8 is ready to use in any SD GUI
Refined with baked FP32 Lora's
Quantized from the FP32 SDXL model for less loss
4GB SDXL (Full Checkpoint)
Custom CLIP is not Quantized
Custom UNET quantized to FP8 allowing for a balance of size and quality
Works in FORGE, Comfy-UI and Auto-1111
Works with LORA's
Beta/Deis is a good choice for img 2 img up-scaling
Both models have improved anatomy (Uncensored) for females however GGUF version does not do well with males.
Description
FAQ
Comments (13)
interesting work first pony now SDXL with 4 gb, i cant wait for try your 4 step versions of these models to draw various results, Good work man appreciate the hard owrk and effort
The Step Models rely on timestamp/schedulers, given the lora is less the 400MB and in intended to work with LCM I am not sure it would merge in well.
Can someone help me understand the goal/value/benefit here? I truly don't understand the significance of the 4GB or the added written detail. Thanks!
@moocoop The benefit would apply to 6GB 2060 and 3050 users. Users who keep multiple models loaded into VRAM, such as PONY + XL or PONY + FLUX and have to watch VRAM usage
What is this sorcery ?
The type that uses 2 bits of precision compared to 23 bits on the original FP32 model. The sorcery is the fact they can predict what the number would have been with any measure of accuracy thanks to graphing
@Felldude What else did you do... Did you use the original SDXL model and tweaked it or you used a finetuned model or merged model, or you trained your own images on top of something?... Cuz the results are impressive.
@punkbuzter340 The clip and unet have both been modified, it has multiple FP8 trainings baked in
I don’t understand why use the fp8 model if with the --medvram argument it is not loaded into memory, and without this argument, it works for me about 6 times slower than the classic fp16 with the --medvram argument. The original idea was to work with insufficient video memory?
It was to enable those with 6GB RTX cards to fit into VRAM, possibly IPEX on Integrated Intel GPU's - Or those using a feed into FLUX, you might be able to fit a 4GB alongside a NF4 version of FLUX into a 16GB card
Only a 4090 has accelerated FP8 attention and pytorch doesn't support it yet so we still upcast to FP16 or BF16
@Felldude I understand, thanks for the explanation
as far i know the fp8 isn't that really good with some H.W , might be sort old H.W



