*Important, please read this before using the model because this is very experimental, as I am still trying to fine the optimal settings
You can download the clip text encoder here: https://huggingface.co/suzushi/miso-diffusion-m-beta
Miso Diffusion M (Beta) is an attempt to fine tune stable diffusion 3.5 medium on anime dataset. In comfy ui it uses as little as 2.4 gb vram without the t5 text encoder. This version is a step up from previous version (alpha) , trained on 160k image for 3 epoch to see how it adapts to anime.
Recommanded setting, euler, cfg:5 , 28-40 steps, though dpm ++ 2m also works but haven't done much testing, prompt: danbooru style tagging. I recommand simply generating with a batch size of 4 to 8 and pick the best one.
Quality tag
Masterpiece, Perfect Quality, High quality, Normal Quality, Low qualityAesthetic Tag
Very Aesthetic, aestheticPleasent
Very pleasent, pleasent, unpleasentAdditional tag: high resolution, elegant
Training is done on gh200. Switched lr scheduler to cosine this time
Training setting: Adafactor with a batchsize of 40, lr_scheduler: cosine
SD3.5 Specific setting:
enable_scaled_pos_embed = true
pos_emb_random_crop_rate = 0.2
weighting_scheme = "flow"
Train Clip: true, Train t5xxl: false