Experimental conversion of SDXL architecture to 16 channel latent space
This is an experimental pretrain on top of Rouwei-0.8 that works with 16 channel latent space and uses Flux ae.
Goals:
Achieve better details while maintaining low compute requirements and all existing knowledge and performance
Possibility of joint sampling with Flux/Chroma/Lumina and other models with same latent space
Current state:
Early alpha version, it is pretty raw. Images may contain extra noise and have artefacts in small details, level varies from neglectable to significant. Upscale, samplers/schedulers, styles, even prompt affect it.
Use of GAN upscale models in pixels space instead of latent upscale gives much smoother results, bumping base resolution higher helps too.
Epsilon prediction now, can be converted to vpred or anything in future.
Usage:
Comfyui
Workflow example (Or just pick any image from showcase)
Download the checkpoint (FP32 and Unet-only can be found in HF REPO
Download these nodes (or just use
install missing nodesusing Comfy Manager)Use
SDXL 16ch loadernode to load it, then work just like you used to with sdxlDO NOT REMOVE
Latent multyplyNODES, latents should be scaled before and after processing just same as in regular SDXL inference. This step just isn't hidden yet.
If you're getting error mat1 and mat2 shapes cannot be multiplied (_x16 and 4x3) - disable the preview option for Ksampler. It happens because preview uses taesd vae designed for 4channel.
Other UI
Since the main difference is just shapes of tensors, used vae and latents scaling factor - it should be easy to implement support to any other UI.
Lora adapters, controlnet, ip-adapters, other things untested.
Joint sampling:
Since the model operates in 16channel latent space similat to Flux, Chroma, Limina-image and some other, you can implement complex workflows (if you have enough memory). This allows to utilize all knowlege of characters, styles, concepts from RouWei along with the performance of bigger models.
Here is an example workflow. Using just few (1..4) steps from Flux you create some rough basic composition. Then the latents come to 16channel sdxl model where denoised (skipping initial high noise timesteps).
It is the most simple approach, since you don't need to reconvert latents though series of vae or some adapters, you can change models on every denoising step without having any performance impact.
Just don't forget to apply Latents multiply nodes between transitions
How it's made
Basically, no changes to default architecture. Just re-initializing if input and output layers to new size, then training with gradual unfreezing of blocks towards the middle.
Default SDXL latent scale factor of 0.13025 doesn't work well here, 0.6 is used for this release.
This is not the most optimal approach. Some changes to the outer layers of the model instead of direct use 'as is' should give improvement in future. If you have any thoughts or ideas about it - please share them.
Training:
To train it (in current version) all you need is to change the number of in/out channels in UNET config and set scale factor to 0.6 instead of 0.13025. And probably check vae part to work properly.
(Code examples later)
I'm willing to help/cooperate:
Join Discord server where you can share your thoughts, proposals, requests, etc. Write to me directly here or dm in discord.
Thanks:
Part of training was performed using google TPU and sponsored by OpenRoot-Compute
Personal: NeuroSenko
And many thanks to all fellow brothers who supported me before.
Donations:
BTC bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c
ETH/USDT(e) 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db
XMR 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ
License:
Same viral as for Illustrious base.
Description
First release
FAQ
Comments (18)
Always appreciate people trying the things nobody else has the patience or know-how to do. I'll be watching this experiment with great interest.
Awesome work! 👏
HELP PLZ! I'm using Flux.1_Krea_Dev FP8 SCALED for the basic latent, but when it comes to rouwei16channel's k-sampler, an error occurs:“mat1 and mat2 shapes cannot be multiplied (15808x16 and 4x3)”
It looks like an issue related to loading of 16ch checkpoint. Does regular workflow without flux work or it gives the same error?
The issue comes from preview option for Ksampler because it tries to use taesd vae designed for 4channel. Just turn off the preview option.
It will be necessary to perform full retraining for this architecture to be successful (or equivalent finetuning). But, I believe this is a good future, with balance between SDXL and FLUX.
It actually is a full retraining to a different latent space, just early release. Compute requirements for it are way higher than for something like vpred conversion. The most heavy part is done, now it needs some minor changes in outer layers and polishing.
But this is just another part of a puzzle (like TE replacement) for the future large training. I don't think it makes sense to spend significant money and time to make yet another sdxl tune. At the same time, training of new dit-based models looks like dark forest due to reasons. So, fixing the main issues of SDXL and training a modified architecture seems to be a good option. Even if we see development of new-gen models for anime arts, it will still be useful in joint workflows due to very high inference speed and style flexibility.
This along with Rouwei-T5Gemma is incredibly fascinating and interesting - Any plans to incorporate them both in the same model? Make a proper rouwei SDXL-Flux? :D
Yes, these are experiments to be implemented together in future model. Maybe also there will be a replacement of few unet blocks with dit, but only very small part to keep high inference speed and low hardware requirements.
@Minthybasis currently playing around with rouwei 16 channel using t5gemma alone, clips alone and t5gemma and clips in concat comparing results. atm concat of both clips and t5gemma seems to produce the most coherent result. this also goes for using other models with t5gemma; concat with clips produces better results.
Also getting intermittent runtime errors with rouwei-16channel, it seems like it's a toss-up for me if a workflow will run initially or not; running the same workflow without making any changes at all sometimes it will work, other times it won't. That one I don't understand.
Also I tried using the unet model alone but for some reason I couldn't get clips to produce anything besides noise so I had to get full model and load clips from there; is the full model not simply flux vae and sdxl clips (g and l) baked in? I' been using external flux vae and that seems to work just fine.
@Minthybasis Well, the error seems like it very well could be related to preview; but when it does run with preview on, it will produce something and the progress can be followed in the preview. I dunno..
Using my own clips with the u-net only is where it seems to run, but only noise is produced; it could be there's an issue with my clips or how I loaded them. I'll just use the full model with baked in clips it's no issue.
But yeah I would love to show you some of the results, especially the clips vs t5gemma vs clips+t5gemma. If I could send the pictures the workflow is embedded in the metadata.
@Phatcat Yes, if it produces only a noise - it can be related to clips. Do they work with 4channel version?
You can upload pictures with workflow to any image hosting that doesn't cut metadata, for example catbox.moe
@Minthybasis No.. They only work with sdxl based models apparently; so anything illustrious based will result in garbage output.. Apparently the clips are not quite the same..
Somewhere in the pibeline it seems to have the clips unfrozen during training, perhaps even pre-illustrious.
On a related note, NoobAI is suffering from broken clips and there's a write-up on it you may or may not have seen:
https://www.reddit.com/r/StableDiffusion/comments/1o1u2zm/text_encoders_in_noobai_are_dramatically_flawed_a/
https://www.reddit.com/r/StableDiffusion/comments/1o25x9t/text_encoders_in_noobai_are_part_2/
Thank you for sharing your work.
How do I train a lora for this model?
A few edits in trainer code need to be made:
1. Change channels count from 4 to 16 in model config
2. Change VaeScaleFactor from 0.13025 to 0.6
3. Adjust part of code that is related to latents creation.
First two can be done pretty easily and enough if you're using precomputed latent. Last part is more complicated, I'm going to upload some code to weekend or after it.
MinthyBased does it again!
@Minthybasis Hey.. So... Flux 2 VAE is out..
So is Z-Image-Turbo, with more Z-Image models to come.
What does that mean for this project (16-channel; rouwei with flux vae), the t5gemma as sdxl encoder project and rouwei in general?
You moving rouwei to zit? Moving focus from flux 1 vae to flux 2 vae? Just keep working on getting rouwei-sdxl to play nice with flux 1 vae? Or something completely different?
Doesn't work well. I thought it was at least 50% standalone
Details
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.












