In depth retraining of Illustrious to achieve best prompt adherence, knowledge and state of the art performance.
Big dreams come true
The version number is just an index of current final release, not a fraction of the planned training.
Large scale finetune using gpu cluster with a dataset of ~13M pictures (~4M with natural text captions)
Fresh and wast knowledge about characters, concepts, styles, cultural and related things
The best prompt adherence among SDXL anime models at the moment of release
Solved main problems with tags bleeding and biases, common for Illustrious, NoobAi and other checkpoints
Excellent aesthetics and knowledge across a wide range of styles (over 50,000 artists (examples), including hundreds of unique cherry-picked datasets from private galleries, including those received from the artists themselves)
High flexibility and variety without stability tradeoff
No more annoying watermarks for popular styles thanks to clean dataset
Vibrant colors and smooth gradients without trace of burning, full range even with epsilon
Pure training from Illustrious v0.1 without involving third-party checkpoints, Loras, tweakers, etc.
There are also some issues and changes compared to the previous version, please RTFM.
Dataset cut-off - end of April 2025.
Features and prompting:
Important change:
When you are prompting artist styles, especially mixing several, their tags MUST BE in a separate CLIP chunk. Just add BREAK after it (for A1111 and derivatives), use conditioning concat node (for Comfy) or at least put them in the very end. Otherwise, significant degradation of results is likely.
Basic:
The checkpoint works both with short-simple and long-complex prompts. However, if there are contradictory or weird things - unlike with others they won't be ignored affecting the output. No guide-rails, no safeguards, no lobotomy.
Just prompt what you want to see and don't prompt what shouldn't be on the picture. If you want to have a view from above - don't put ceiling into positive, if you want to have crop view with head out of frame - don't make detailed description of character facial features, and so on. Pretty simple but sometimes people are missing it.
Version 0.8 comes with advanced understanding of natural text prompts. It doesn't mean that you are obligated to use it, tags only - completely fine, especially because understanding of tags combinations is also improved.
Do not expect it to perform like Flux or other models based on T5 or LLM text encoders. The whole size ot SDXL checkpoint is less then only that text encoder, in addition illustrious-v0.1 which is used as the base completely forgot a lot of general things from vanilla sdxl-base.
However, even in current state it works much better, allows to do new things usually impossible without external guidance, as well making manual editing, inpainting, etc more convenient.
To achieve best performance you should keep track of CLIP chunks. In SDXL the prompt is separated into a chunks of 75 (77 including BOS and EOS) tokens, that are processing by CLIP separately, and only then are concatinating and comes as conditions to unet.
If you want to specify some features for character/object and separate them from other prompt parts - make sure they are in the same chunk and optionally separate it with BREAK. It will not solve problem of traits mixing completely, but can reduce it improving overall understanding, since text encoders on RouWei are able to process the whole sequence, not individual concepts better then others.
Dataset contains only booru-style tags and natural text expressions. Despite having a share of furries, real life photos, western media, etc. all captions have been converted to classic booru style to avoid a number of problems from mixing of different systems. So e621 tags won't be understanded properly.
Sampling parameters:
~1 megapixel for txt2img, any AR with resolution multiple of 32 (1024x1024, 1056x, 1152x, 1216x832,...). Euler_a, 20..28steps.
CFG: for epsilon version 4..9 (7 is best), for vpred version, 3..5
Sigmas multiply may improve results a bit, CFG++ samplers work fine. LCM/PCM/DMD/... and exotic samplers untested.
Some schedulers doesn't work well.
Highresfix - x1.5 latent + denoise 0.6 or any gan + denoise 0.3..0.55.
For vpred version lower CFG 3..5 is needed!
For vpred version lower CFG 3..5 is needed!
Quality classification:
Only 4 quality tags:
masterpiece, best qualityfor positive and
low quality, worst qualityfor negative.
Nothing else. Actually you can even omit positive and reduce negative to low quality only, since they can affect basic style and composition.
Meta tags like lowres have been removed and don't work, better not to use them. Low resolution images have been either removed or upscaled and cleaned with DAT depending on their importance.
Negative prompt:
worst quality, low quality, watermarkThat's all, no need of "rusty trombone", "farting on prey" and others. Do not put tags like greyscale, monochrome in negative unless you understand what are you doing. Extra tags for brightness/colors/contrast section below can be used
Artist styles:
Grids with examples, list/wildcard (also can be found in "training data").
Used with "by " it's mandatory. It will not work properly without it.
"by " is a meta-token for styles to avoid mixing/misinterpret with tags/characters of similar or close name. This allows to have a better results for styles and at the same time avoid random style fluctuation that you may observe in other checkpoints.
Multiple give very interesting results, can be controlled with prompt weights and spells.
YOU MUST ADD BREAK after artists/style tags (for A1111) or concat conditioning (for Comfy) or put them in the very end of your prompt.
For example:
by kantoku, by wlop, best quality, masterpiece BREAK 1girl, ...General styles:
2.5d, anime screencap, bold line, sketch, cgi, digital painting, flat colors, smooth shading, minimalistic, ink style, oil style, pastel styleBooru tags styles:
1950s (style), 1960s (style), 1970s (style), 1980s (style), 1990s (style), 2000s (style), animification, art nouveau, pinup (style), toon (style), western comics (style), nihonga, shikishi, minimalism, fine art parodyand everything from this group.
Can be used in combinations (with artists too), with weights, both in positive and negative prompts.
Characters:
Use full name booru tag and proper formatting, like karin_(blue_archive) -> karin \(blue archive\), use skin tags for better reproducing, like karin \(bunny\) \(blue archive\). Autocomplete extension might be very useful.
Most characters are recognized just by their booru tag, but it will be more accurate if you describe their basic traits. Here you can easily redress your waifu/husbendo just by the prompt without suffering from the typical leaks of basic features.
Natural text:
Use it in combination with booru tags, works great. Use only natural text after typing styles and quality tags. Use just booru tags and forget about it, it's all up to you. To get best performance keep track if CLIP 75 tokens chunks.
About 4M of images in dataset had hybrid natural-text captions, made by Claude, GPT, Gemini, ToriiGate, then refactored, cleaned and combined with tags in different variations for augmentation.
Unlike typical captions, these contains character names which is very useful. Better to keep it clean, short and convenient description works best. Better not use long and sloppy BS like
A mysteriously enchanting feminine entity of indeterminate yet youthful essence, whose celestial visage radiates with the ethereal luminescence of a thousand dying stars, blessed with locks cascading like the golden rivers of ancient mythology, perhaps styled in a manner reminiscent of contemporary fashion trends though not necessarily adhering to any specific aesthetic paradigm. Her eyes, pools of unfathomable depth and hue, sparkle with the wisdom of millennia yet maintain an innocent quality that defies temporal constraints...For captioning you can use ToriiGate in short mode.
And don't expect it to be as good as flux and others, it tries very hard and after several rolls usually you can get what you want, but it is not that stable and detailed.
Lots of Tail/Ears-related concepts:
Oh yeah
tail censor, holding own tail, hugging own tail, holding another's tail, tail grab, tail raised, tail down, ears down, hand on own ear, tail around own leg, tail around penis, tailjob, tail through clothes, tail under clothes, lifted by tail, tail biting, tail penetration (including a specific indication of vaginal/anal), tail masturbation, holding with tail, panties on tail, bra on tail, tail focus, presenting own tail...(booru meaning, not e621) and many others with natural text. The majority works perfectly, some requires a lot of rolling.
Brightness/colors/contrast:
You can use extra meta tags to control it:
low brightness, high brightness, low saturation, high saturation, low gamma, high gamma, sharp colors, soft colors, hdr, sdrThey work both in epsilon and vpred version and works really good.
Epsilon version relies on them too much. Without low brightness or low gamma or limited range (in negative) it might be difficult to achieve true 0,0,0 black, the same often true for white.
Both epsilon and vpred versions have like true zsnr, full range of colors and brightness without common flaws observed. But they behaves differently, just try it.
Vpred version
Main thing you need to know - lower your CFG from 7 down to 5 (or less). Otherwise, the use is similar with advantages.
It seems that starting from v0.7 vpred works flawlessly now. It shouldn't suffer from ignorance of tags close to the 75tokens chunk borders like nai. It is more difficult to get burned images - even on cfg7 usually it just over-saturated but with smooth gradients, which can be useful for some styles. Yes it can make anything from (0,0,0) to (255,255,255). You will find brightness meta tags described above quite useful for easier/lazy prompting, natural text expressions also work. To get the most dark image - put high brightness into negative and/or use low brightness, low gamma tags. If you don't like very bright skin on dark background and want to reduce contrast (or on the contrary, enhance the effect) - use hdr/sdr in negative/positive.
It was reported that in rare cases on some prompts there is a drop in contrast. Looks like other vpred models have same behaviour with such prompts, adding a "separator" closer to the border of the 75-token chunk fixes this. However, with 0.7 I haven't encountered this myself.
To launch vpred version you will need dev build of A1111, Comfy (with special loader node), Forge or Reforge. Just use same parameters (Euler a, cfg 3..5, 20..28 steps) like epsilon. No need to use Cfg rescale, but you can try it, cfg++ works great.
Base model:
The model here has a small unet polishint after main training to improve small details, bump up resolution and others. Hovewer, you may be also interested into a RouWei-Base, which sometimes can perform better at complex prompts despite having minor mistakes in small details. It also comes in FP32, for example if you want to use fp32 text encoder nodes in Comfy, merge it or finetune.
It can be found in Huggingface repo
Known issues:
Off course there are:
Artists and style tags must be seperated into a different chunk from main prompt or come very last
There may be some positional or combinational bias in rare cases, but it's not yet clear.
There are some complaints about few of the general styles.
Epsilon version relies too much on brightness meta tags, sometimes you will need to use them to get desired brightness shift
Some newly added styles/characters might be not as good and disctinct as they deserve to
To be discovered
Requests for artists/characters in future models are open. If you find artist/character/concept that perform weak, inaccurate or has strong watermark - please report, will add them explicitly. Follow for a new versions.
JOIN THE DISCORD SERVER
License:
Same as illustrious. Fell free to use in your merges, finetunes, ets. but please leave a link or mention, it is mandatory
How it's made
I'll consider to make a report or something like it later. For sure.
In short, 98% of work is related to dataset preparations. Instead of blindly relying on loss-weighting based on tag frequency from nai paper, a custom guided loss-weighting implementation along with asynchronous collator for balancing have been used. Ztsnr (or close to it) with Epsilon prediction was achieved using noise scheduler augmentation.
Spent compute - over 8k hours of H100 (apart from research and fail attempts)
Thanks:
First of all I'd like to acknowledge everyone who supports open source, develops in improves code. Thanks to the authors of illustrious for releasing model, thank to NoobAI team for being pioneers in open finetuning of such a scale, sharing experience, raising and solving issues that previously went unnoticed.
Personal:
Artists wish to remain anonymous for sharing private works; Few anonymous persons - donations, code, captions, etc., Soviet Cat - GPU sponsoring; Sv1. - llm access, captioning, code; K. - training code; Bakariso - datasets, testing, advices, insides; NeuroSenko - donations, testing, code; LOL2024 - a lot of unique datasets; T.,[] - datasets, testing, advises; rred, dga, Fi., ello - donations; TekeshiX - datasets. And other fellow brothers that helped. Love you so much ❤️.
And off course everyone who made feedback and requests, it's really valuable.
If I forgot to mention anyone, please notify.
Donations
If you want to support - share my models, leave feedback, make a cute picture with kemonomimi-girl. And of course, support original artists.
AI is my hobby, I'm spending money on it and not begging for donations. However, it has turned into a large-scale and expensive undertaking. Consider to support to accelerate new training and researches.
(Just keep in mind that I can waste it on alcohol or cosplay girls)
BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c
ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db
XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ
if you can offer gpu-time (a100+) - PM.
Description
Vpred for v0.8
FAQ
Comments (113)
it's so peak ❤️❤️
it's so peak x2
i have a question which might be dumb. noob vpred is more trained than illust and over all a better base so why not train on top of that model?
noob v-pred is a good model to use, but not a good model to train anymore. The more a model was trained, the less it can learn.
That's a good question, not dumb at all.
The first version of RouWei was started approximately in the same time as Noob, but with a little different goals and approaches. All next versions are the development of previous, since at the moment of training there were no better base model according to specified criteria.
NoobAi checkpoint has both a number of advantages and serious issues. After a long training, existing knowledge in base will not play significant role, but the inherent biases and problems may only become more pronounced and make training more difficult. Therefore, at this point, choosing another base will not bring any benefits.
Also, like it was mentioned, noob can be a bit troublesome to train.
base model serve as a block of clay, the clay need to be pure in substance so it can be mold into anything(meaning it has no biases), and the clay need to be complete and full with volume, otherwise when training subsequence loras/finetunes it has no existed weights/concept to be adjusted from/with cause it is not present in that particular base model
but a v pred version base model might be interesting. is that possible? a base model that is trained in v pred configuration.
@dfijgklerhjkldghtjykghljg Well there is no 'base' model for vpred, it has been converted from base without extra aesthetic tuning on top of it. So the published one can be considered as the base for vpred.
Also, some issues in vpred were fixed as much as possible in a relatively small training.
I wanted to ask one question, did you use not illustrious 0.1, but version 1.0 or 2.0? I noticed that at high resolutions the model starts to produce "a lot of gray" like new versions of illustrious. I'm just curious)
Like discussed previously, the history of checkpoit: Iluustrious-v0.1 -> Rouwei 0.6 -> Rouwei 0.7 -> Rouwei 0.8. Switching to other is tantamount to losing everything that was achieved earlier for the sake of somewhat questionable improvements.
As for gray - might be related to bugs in text encoder, speaking briefly - can be solved by moving some tags into other 77-tokens chunk. Haven't seen this in new illustrious versions, but that's interesting. Does it only happen when increasing the resolution or with some prompts?
@Minthybasis only with increasing resolution. I experimented with prompts and in principle with some settings reforge, forge and even standard automatic1111 at high resolutions (without upscaling) a pronounced "soap picture and grayness" begins, I noticed this in illustratious 1.0, 1.1, and 2.0, in 0.1 there was no such thing, so I thought that the original model had changed, but in your model everything does not go into grayness so much, but there is a slight tendency, perhaps this will not affect training LORAs, I have not tested it yet.
@OneRing Please write if there will be issues with lora, this things needs to be investigated.
@Minthybasis trained LORA, the problem practically disappears and the larger the LORA, the fewer problems.
I now use 0.8 base to train loras, awesome awesome model, will test v pred variant later on
Vpred is awesome, with the natural language added and the details, and more characters (specially nikke hehe), but I notice that most of the time it generates busty females even if I don't put the tag for it and put a lot of negatives to avoid a huge size, still, it just ignore it and make them big
Oh no, it's supposed to generate cunnies by default 😭! Just kidding, does it occurs on any specific artists, characters or very common in general?
@Minthybasis cunnies huh 🤨, but hey to each their own haha, but I've been testing some nikke characters Who are not that "gifted", but still even with negative they have huge breast , and no soecif artist, hit 2.5D or illustration style
putting in negative is not enough, did you go with adding (the name:1.5) in negative?
@alternative_Universe Hm, finding on danbooru characters from Nikke with actually small breasts is a little challenge, lol. Copy-pasting prompts or (re)writing manually gives exactly desired size, from flat to huge. Could you upload some examples for reproducing?
It should just work without any negatives. Except may be if you're using some breast-related tags like cleavage/paizuri/etc. which have slight bias for size, but anyway shouldn't be that bad.
@Minthybasis 2.5d style ,1girl, Privaty /(nikke/),(nikke), /(goddess of victory: nikke/) ,taking selfie in the bathroom, posing seductively, bitting own lips, front view , serious, shy, red jacket,sweaty,masterpiece, best quality, newest, absurdres, highres, high contrast,hdr
It's the prompt with negative like big breast or busty, characters like Alice,private, Dorothy etc, always makes them busty lol,don't know why it ignores the negatives
@alternative_Universe What about just using a general tag to specify the size you want? https://files.catbox.moe/gmg103.jpg
Also would like to point that for covering emphasis brackets \ should be used instead of /, or result will be unpredictable. Also tags newest, absurdres, highres haven't been introduced in dataset specifically, I don't know whether their use will lead to a positive effect, or opposite.
@Minthybasis will try again, I used mediium size breast or natural size but they still huge lol,also,thanks for the \, I was using it wrong then this whole Time,thanks for the link I will check it
@Minthybasis sorry to bother again,but the link is not available,any chance for a reupload?, still not working with negatives, getting very busty ladies lol, so want to try that list :(
@alternative_Universe Sure, reupload in few hours when get back to pc.
@alternative_Universe Here https://files.catbox.moe/6otrqv.jpg no accidental nipples this time.
@Minthybasis you know i was expecting an advanced list with words,cant believe was that simple lol, thank sooo much, being enjoying and exploring 0.8 vpred a lot
Version 0.8 is great, better than NoobV1.1 in my opinion. It contains some concepts I like and trains better than noob.
It would be even better if we have auxiliary models such as "ip-adapter" on this basis. Will you train "ip-adapter"?
Anyway, the 0.8 version of the model already does a great job.
Thank you. I had some thoughts about training controlnet/ip adapter. Maybe later, but chances are not too high since there are already a lot of plans within limited time/compute and models from sdxl/illustrious/noob seems to work okay.
@Minthybasis Yes, most of them work fine, but relatively speaking, they are not as perfect as the 0.8 model, and they would be much more perfect if properly fine-tuned.
After testing all three versions, surprisingly, only the base model seems to work well in my runtime environment, vpred (either the file provided on civitai or hf) inexplicably burned up, epsilon had anatomical problems, and base using the same seed rendered correct anatomy, and outside of that, the base model seems to be able to handle Zero Terminal SNR (I'm not sure if this is in line with design expectations), well, what's certain is that the base model is pretty awesome!
(Translated with DeepL)
love
Vpred 0.8 is what other model makers should aspire to. Yeah, you have to use BREAK, but, let me tell you all, not a single model can do as much as RouWei can in terms of knowledge. Prompt adherence is also superb and stability is much improved.
I would also ask to improve bangs (wispy, fanged, choppy, v-bangs, loosely tucked bangs, long hair between eyes).
Can you share your settings, please? Your UI (Comfy, Forge, etc), samplings, cfg, example prompts? For me it feels like model is barely trying to follow prompt, which is very underwhelming. I understand, that it's a base model and that i'm doing something wrong probably. It's like there is zero details - maybe it's normal, because i'm used to being spoonfed by WAI, but still.
@kanareika1 Sure, I use Euler A, 28 steps, 3.5/4 CFG (helps with backgrounds), I often do 1216x832/832x1216/1024x1024 resolution. For scheduler, use sgm uniform, imo, it's the best for RouWei, karras doesn't work well, and normal creates visible artifacts.
I generate images this way:
masterpiece, best quality, (one of the General Styles) BREAK 1girl, (animal girl prompts if you have any) hair-color, hair-length, hair-style (twintails, etc), eye color, body tags (breast size, thick thighs/curvy/etc), body traits (tattoos/markings/piercings) BREAK clothes, actions, facial expressions and background.
I'm a little curious as to what was done to optimise the negative prompts during RouWei's training, and exactly what types of meta tags were removed - after some comparative generation using the base version, I found that the inclusion of the negative prompts still had a modest improvement in the quality of the generated images,and that the simple The same is true for the simple inclusion of high resolution tags in the positive cues.
In my personal experience, some of the meta tags in the booru dataset reflect the ‘quality’ of the data in some way - for example
https://danbooru.donmai.us/wiki_pages/lossy-lossless
and
https://danbooru.donmai.us/wiki_pages/photoshop_(medium),
actually implies a lowres/highres effect that is more useful than their original tag.
(Translated with DeepL)
Resolution tags like absurdres, highres, lowres, etc. Despite majority of them were removed, yet some may still persist and with illustrious legacy, where they work in base, can kind of work. If you like the effect - of course you should use them, just want to warn that after upscale and with some styles they can affect negatively. Mentioned photoshop and lossy tag might be quite useful (if they didn't add unwanted biases), this is a good discovery.
One example of optimizations are newly introduced meta-tags that characterize pictures in specific ways (starting from color and finishing with peculiarities of composition, added effects and other). But they are needed for better training first of all, not for inference because when called they may have too strong effects and biases.
The words about keeping the negative prompt clean are mostly related to situations, where people spamming numerous tags, and then complaining about flexibility or other issues that are actually caused by this. If you know what are you doing - make it whatever your creativity wants.
out of curiosity, is the date metadata tag completely removed? sometimes I want to isolate and direct to a specific period of time, some artist changed their style, some copyrighted series changed its style, etc etc, I love the year 2025~2005 tag and newest <-> oldest tag.
If that tags was in original set of tags from danbooru - it should work. But no special tags based from picture upload date were introduced.
I decided no to use it because of the need for complex augmentation to make such divisions work really well and the possible side effects. Perhaps something like this will be introduced in the future.
@Minthybasis awesome!❤
Can someone explain the CLIP chunks to me? I'd like to understand this and BREAK better so I can hopefully generate better pictures without having to go the complex Region Prompting route.
Also, do other models merged with RouWei inherit this?
SDXL uses text encoder parts from CLIP (2 of them with different size actually) which originally can process only 77 tokens input, 75 for meaningful excluding BOS and EOS tokens. When you using prompts longer than 75 tokens, it is being divided into chunks of 75 tokens that are encoded separately. After it, hidden states from last layer before projection (or from deeper if clip skip is used) for each processed chunk are contatinated and used as input for unet.
So, if you want to use some tags, describe features, things, etc. that are related to something specific - it is better to have them in the same clip chunk, so text encoder will be able to assess them together. Some for splitting of something, that you don't want to mix, or what gives bad result when along with something.
Of course, clip text encoders are quite small and dumb and unet has it's own attention when processing combinations, but managing parts of prompt might be quite beneficial in several cases.
Yes, it can separate characters features mentioned in prompt, the outcome depends from checkpoint. But still it will not be as stable as in models with more complex text encoders like t5 or llms, or with region prompting.
im done... best model ever i have seen... full control of image generations, ez to add every style by lora... i love this model
I think I'm done merging models for now. Rouwei 0.8 vpred is simply the best option for anti-slop 2D image generation.
As for feedback, the masterpiece and best quality tags seem to prevent the generation of certain styles like jagged lines, flat colors, pixel art, or oekaki. While these styles might not typically be considered masterpieces, the tags have such a strong effect that I struggled to achieve my desired results even with careful use of quality tags.
Regardless, it's an amazing model when you know how to use it. Thank you for sharing it.
I can vote for your input, one of my comic book lora was facing the same situation, even when the prompt is prioritized on monochrome and greyscale, meaning these two tags were put in front of the clip text chunk, the effect of comic book drawing was not showing up, until masterpiece tag and best quality tag were taken off.
but now, just as a general common working principle, when I use my loras I do not put any broad scale comprehensive quality control meta-tag in both positive and negative prompt, let the lora work on the entire vector space, full spectrum influence on model weights.
At the moment I am training a fairly large LORA and I was so surprised that with the previous settings without any significant changes the losses decreased by about 2.5 times, this is amazing. I was wondering, in addition to increasing the dataset, were there any significant changes in the training settings that managed to achieve such an impressive result? No model even comes close to the results of Rouwei at the moment.
Glad to hear that you got a good results!
Hm, honestly I don't know. Main changes for v0.8 comparing with 0.7 are lots of extra augmentation, better captions and very diverse balanced dataset, so may be some of these or all together.
This one and noobai are my favorites😊♥ Thank you for your hard work in bringing us such a great model!
the model is so good, it should always be available on site generation
It was available on site gen before, but because no enough bid it's unavailable now.
@LOL2024 yeah I know, I enjoyed that time
why do I get weird blurry outfit with comfy, with LLM clip also the same. WAI Rouwei works fine but it's based on v0.7.
Could you upload some examples of the issue with metadata?
removed all prompt and accidently hit the generate button. the model shows its true subconsciousness. it's nsfw expert 🤣
Just curious, If no prompt guidance, the model will 95% output r18 and super xxx things.
Does it mean those contents were not properly tagged during training?
Or the dataset mainly contains xxx things? (this seems unlikely)
reakaakasky No, likely it comes from caption drops that are important part of augmentation during training. But this phenomena is quite strange because most of pictures with the 'drop_possible' flag are sfw.
incredible model <3 great prompt adherence
Hi Minthybasis, hi artists, what do you think is the difference between the vpred and epsilon version?
The main difference is the way it samples images, vpred gives more control on brightness/colors. However, some people finding e-pred more creative and it has better compatibility with popular loras.
Also there can be some slight differences in styles and default traits.
You've created something special with this model. Adding the stabilizer lora on top of it makes it spectacular.
I'd like to know if this VPRED model supports ZSNR.
I'm planning to use this model for LoRa character training.
It's use noise offset or ZSNR?
Yes, it was trained with ztsnr, has full range and capable of generating any brightness/saturation. You should enable the option in training.
But for noise offset - it is a crutch that should be avoided with normal vpred models. Even with e-pred, pyramid noise is way better in most cases.
What is this ZSNR and how does it affect image generation with VPRED NoobAI models? Could you please enlighten me? I have this weird noise/grainy look to my images that I wish to get rid of.
krewg Zero Terminal SNR for better colour and lighting (Lighter and darker)
try to change a sampling or lower cfg
I’ve already tried quite a few different Noob and IL mixes, but this is the first model that immediately impressed me so much with its stability and responsiveness. Your model makes me want to fully switch to it, but I’m struggling with the question of how best to train LoRAs for it. I have many LoRAs trained in OneTrainer using the base NoobAI vPred and Illustrious, but they all look rather mediocre on this model.
Should I be using RouWei itself as the base model for training, and should the parameters be similar to training on NoobAI, considering it’s a vPred model? (Though I still don’t understand how vPred RouWei works in Comfy for me without v_prediction sampling 🤔)
Most of the time I have to work with very small datasets (10–20) to create lesser-known characters in a style as close to the original as possible, so I use Prodigy to squeeze everything I can out of limited material. I’d be very grateful if you could advise me which of the attached configs (EPS IL or vPred Noob) would be better to rely on when training LoRAs for your model, or if my approach is completely wrong and I should reconsider it entirely.
IL - https://files.catbox.moe/vv9btp.json
NAI - https://files.catbox.moe/kodhms.json
In any case, thank you for such great work and good luck with future versions
Thank you for kind words.
You should use rouwei as base model, parameters from noob-vpred should be fine. Some differences may come from a different noise scheduler, so you can try to play around enabling debiased noise estimation or mnsnr, or train with edm2. But all this is also relevant to any vpred model including noobai.
Vpred version of rouwei contains flag in state dict which is an unspoken standard and allows software to detect and use vpred sampling by default. If you set e-pred sampling mode manually with vpred version - it will generate only noise blobs.
Your configs looks okay (not sure about loss_weight_strength but it seems that it is used only for specific loss functions), basically the differences are only in epsilon or velocity prediction types. I'm not a big expert in style lora training, but haven't heard about any specific model-related nuances here.
Minthybasis Thank you for the explanation! Regarding such a high loss_weight_strength, I was thinking that setting it to an elevated value might help the LoRA train more aggressively given the small dataset size, while Prodigy could soften the consequences if needed, since the risk of overfitting with it is lower than with other optimizers. Still, I can’t shake the uneasy feeling that everyone else is using Adam or something similar and speaks rather negatively about Prodigy… it kind of makes me feel like an idiot because of that
degurshaft Well there are a lot of different optimizers and at least a dozen are actively used. And each one has own application, like prodigy and some schedule-free can give a good result with small dataset peft, adamw works great for large-scale applications, ademamix can be optimal choice for post-training, etc.
You can try different, compare them and select the one with best results for specific case. There is nothing stupid about it.
Minthybasis Yep, I really chose Prodigy because of its advantages with small datasets. But to my disappointment, after running comparisons with identical settings and taking into account all the models specifics you mentioned in the description, for some reason I can’t achieve visual quality comparable to generations on the WAI with a lora trained on the base IL.
If possible, could I take a little of your time and contact you on Discord to send some examples with metadata? I’d really appreciate it if you could maybe point out what I’m doing wrong
degurshaft I don't mind, but I'm not a big expert in lora training. Likely it's better to ask people who make it regularly.
To achieve pretty look, you can add extra tweakers, enhancers, may be even some other styles that makes image closer to desired with low weights. Or try to generate on merges.
>> I really chose Prodigy because of its advantages with small datasets.
doubt, personal exp: prodigy always blows up my training, so I sticked with adawm + lr 0.00005, rank/alpha =1, bs=8 ~1000 steps, when training a tiny dataset ~10 imgs, and let it slowly cook.
reakaakasky That’s exactly the kind of opinion about Prodigy I come across most often. Maybe you could share your config so I can try to compare? In my experience, no matter how many times I try switching back to AdamW, it still doesn’t capture the original style as well as Prodigy. I usually stick to 1000 steps too, though I’m not sure about your lr, the value you mentioned seems more like the encoder lr
degurshaft i don't have prodigy settings anymore.
iirc, prodigy can learn things very fast and very well, but usually has stability issue, e.g. when stack it with other lora, there will be color blobs. That's why i switched back to adamw. I'm not 100% sure that was because of prodigy.
lr 0.00005, yeah, that is unet lr, very low, let it cook. I don't train TE.
reakaakasky I was actually asking specifically about AdamW. Im actively testing it right now, and it would be interesting to see other peoples approach to training with this optimizer. I havent noticed any artifacts when mixing a LoRA trained on Prodigy with another one.
And again about that lr, even if that value (0.00005) is for unet, it still seems oddly low to me. Shouldnt it either match the global lr or be just slightly lower? For AdamW I use 0.0005, and unet is set to the same. I also dont train the t encoder, but Im thinking of trying it cuz I saw somewhere that activation tokens start working better
degurshaft oh... I misunderstood.
alpha=rank=8
lr=0.00005
batch size=8
steps ~1000
min snr gemma=1
no noise offset etc. basically everything is default.
____
don't know what is the "global lr", I guess just a default value if no unet/te lr was specified.
reakaakasky Is such a low lr related to using a large batch size? I ve just never really tried a batch larger than 2, since I felt that increasing it eats away at the details, especially with Prodigy (considering it requires lr=1 for all networks)
degurshaft don't know.
I usually don't care about details when training on small dataset. as Misthy mentioned
To achieve pretty look, you can add extra tweakers, enhancers, may be even some other styles that makes image closer to desired with low weights. Or try to generate on merges.
I use other LoRA to add similar details.
The prompt adherence and responsiveness is really incredible for a SDXL model
Any workflow for Comfy please?
default workflow works well with it. latest comfyui recognizes eps/vpred on load without any special nodes
All I get is weird colored blobs. I'm using Euler A and same prompt as the sample image. It's weird because all the other illustrious based models I've tried work just fine. I'm using Invoke.
Are you checked your negative prompts? This model usually output bad results if you put too long negative prompts, also try to remove embeddings, some of embeddings like lazyneg works bad in Rouwei
LOL2024 Thanks for the reply! I only have "bad quality, worst quality, watermark, artist_logo," as negative prompts and I turned off all embeddings and loras as well, trying to do bare minimum workflow. I also tried many other schedulers aside from Euler A and I don't think it could be the VAE as I already see the image forming weirdly in the preview.
rancidy164 Are you tested both EPS and V-pred, then they all got similar corrupt results?
LOL2024 Epsilon actually works!
You should update your software because it can't detect that the model is using vpred sampling. If you're using a1111 - switch to dev branch, it works there.
I had horrible results at first, but changing the scheduler from karras to simple it now works as intended
https://drive.google.com/file/d/1hdfc6mJF4MEuFyKjxyDlClnC8QYwAHKY/view?usp=sharing
crappily replaced artist and character tags in danbooru.csv with ones in training data, sacrificing entries count and alias
This checkpoint ignores 90% of my prompt for some reason. It gets the character, but keeps making them nude even though I describe their clothes. Also, it doesn't seem to know what 'foot focus' means.. I can tell it would be really good if it would just adhere to my prompt.
Maybe you didn't add "by" when using the artist tag? For example, you wrote "wlop" instead of "by wlop"
Can you upload some examples where you get bad results with metadata? The model has some biases and places where you can slip, which are described, but in general it should give the opposite experience to what you're getting.
can you add newest, recent, early, old etc time related tags to get different styles from certain years better?
You mean time tags in general (like for 1990s, 2010s, etc.) or for each artist style?
@Minthybasis yes getting certain artist styles is hard if their artstyles change
@mahouou That's a complex thing because lazy introducing of tags based on timeline won't solve the problem. In v0.8 I tried to split some styles by using a combination of texture harmonics and embeddings to clasterize them and split into buckets. But I wasn't able to set up that system that would allow to do this fully automatically for whatever style, without the need for manual tweaks and supervision.
I'll try to introduce more for next version of dataset. Btw if you want to have this for some exact artists - just list them, they will be prioritized.
This model is really great, my favorite Illustrious finetune! Have you had the chance to look at the progress in regards to the new Lumina model? It seems to vastly improve over Illustrious base models and I was wondering if you ever thought about tinkering or finetuning the base Lumina model or the NetaYume model? The results look vastly more natural and less "AI" so to speak. If you have the time, I recommend taking a look :)
Thank you! Sorry for the long reply, got messed by civit notifications.
Yes, the Lumina looks very promising, one can say that this is what was expected from the sd3/3.5. Considering that in any case it is necessary to move in that direction, eventually I'll release a finetune of dit or some hybrid model. But no promises of timeframes, too much uncertainty.
Which vae is used for this ckpt? I can't find vae from downloading.
https://civitai.com/images/102352031
This is my test generated result, the quality of details is not good.
The model is supposed to work with standard sdxl vae (fp16 fix, baked in, not eq), can also be used with its forks with boosted contrast/saturation.
To improve the quality of generated pictures it is better to give more detailed prompt, specifying exactly what you want to get, and then use styles (one or several together).
I don't like the look of euler ancestral, what other options do I have? I used dpm 3m sde on other models but sadly it doesn't work well here
Epsilon version have some issues with unconventional samplers (actually more with schedulers). As for vpred - most should work fine. Cfg++ samplers work great and give nice results, may be you should try them.
Do you have any plans for v0.9? I already like v0.8 a lot, just wondering since the time between v0.7 and v0.8 was like 5 months, and it's been about 5 months since v0.8.
Hi, yes I do have plans for future. Currently I'm testing modification of text encoder, which is the most weak part in sdxl and conversion whole latent space to 16chanels (to use flux vae). Test variants for text encoder is published here, alpha version for 16ch pretrain likely will be released next week.
After things will be clear and new dataset ready, I'll make a large training for major version with all new features that should be something really new.
Of course if everything goes well.
So this is like an advanced model for newbies? I just don't get the point
Hey Minthy, both v0.8 versions have issues with long_bangs: it's quite hit or miss compared to v0.7
概括(不是无脑翻译,让头脑简单的人看得懂,如果这样还看不懂,你这智商也别玩图了,回家玩屌去吧):
这个模型是什么?
它是一个专门用来画**二次元/动漫风格**图片的AI模型(基于SDXL架构)。可以把它理解成一个经过海量图片(1300万张)特训的“天才画师”。
### 它牛在哪里?(核心优势)
1. 听话(提示词遵循度高): 你让它画什么,它就画什么,不太会“自由发挥”些奇怪的东西。它很擅长理解**标签**(比如 1girl, blue hair)和**自然语言**(比如“一个蓝头发的女孩”)。
2. 懂得多(知识量大): 认识超过5万个画师风格,无数动漫角色,还有很多特定概念(比如各种尾巴/耳朵的玩法)。
3. 画得好(美学优秀): 色彩鲜艳、过渡自然、不容易出现过曝或死黑的情况,而且因为训练数据干净,画出来的图很少有烦人的水印。
4. 思维清晰(解决标签泄露): 以前的模型容易把角色特征搞混(比如画A角色却带有B角色的特征),它很好地解决了这个问题。
### 使用它的“黄金法则”(怎么用才出效果?)
这部分最重要,否则它可能发挥不出来。
1. 画师风格必须“特殊对待”
* 规则: 如果你想模仿某位画师的风格(比如 by wlop),必须在这些风格标签后面加上一个 “BREAK”(如果你用的是A1111这类软件),或者把它们放在**提示词的最末尾**。
* 原因: 这能让AI清楚地知道:“这些是风格参考”,而不会把它们和画面里的角色描述(比如“1girl”)搞混。这是它和其他模型最大的不同点。
* 错误示范: 1girl, by wlop, by kantoku, ... (这样混在一起效果会变差)
* 正确示范: by wlop, by kantoku BREAK 1girl, ... 或者 1girl, ... by wlop, by kantoku
2. 怎么称呼画师和角色
* 画师: 必须用 by 画师名 的格式,这是强制性的。比如 by kantoku。
* 角色: 最好用标准的Booru标签格式。比如角色名有括号,要用反斜杠转义: karin_(blue_archive) 写成 karin \(blue archive\)。同时,可以加上皮肤版本标签会更精准,比如 karin \(bunny\) \(blue archive\)。
3. 简短的负面提示就够了
* 不用写一大堆乱七八糟的词。一般来说,负面提示(不想看到的东西)只需要写worst quality, low quality, watermark(最差质量,低质量,水印)就足够了。
4. 巧用“亮度/色彩”调节标签
* 如果你想控制画面明暗或色彩,可以用一些特殊的元标签,比如 low brightness(低亮度)high saturation(高饱和度)hdr(高动态范围)等。这在两个版本里都很好用。
### 两个不同的版本:Epsilon 和 Vpred
这个模型提供了两个“性格”稍有不同的小版本:
* Epsilon版(主流版):
引导尺度(CFG Scale)建议用 *7** 左右。
特点:效果稳定,但*对亮度标签依赖较大**。如果你不手动调低亮度low brightness),它可能很难画出真正的纯黑色。
* Vpred版(新版/潜力股):
引导尺度(CFG Scale)建议用 *3~5**,别太高。
* 特点:颜色更饱满,更不容易画坏(过曝),明暗范围更广(能轻易画出纯黑纯白)。但据说极少数情况下对比度可能有点问题,需要用技巧微调。
### 简单总结
* 这是一款顶级的SDXL动漫模型,特别擅长理解复杂的提示词和海量的画师风格。
* 使用核心秘诀:把画师风格标签用“BREAK”隔开,或者扔到提示词最后。
* 负面提示不用太复杂,简单几个词就行。
* 两个版本怎么选? 图省事用Epsilon(记得调亮度),追求更丰富的色彩和黑白对比可以试试Vpred(记得把CFG调低)。
* 记住: 它的体量远小于Flux那种新模型,虽然已经很努力,但也不是万能的,有时候需要多抽几次卡才能得到满意的结果。
Mind making something like this for SD 1.5? Only reason I care is that you can do SD 1.5 on phones and really fast too with apps like Local Dream that use Qualcomm NPU like honestly feels better than like my GTX 1080 when SD 1.5 was the only thing bafflingly, so a model updated and more fitted for memory constrained phone would be amazing. SD 1.5 is pretty dumb, and not sure where you would improve however so not sure if it's worth it just throwing it out there. Could be a better more modern base with similar memory foot print, but likely require full retraining pretty much if it's generalized. I'd really like to have okayish model on my phone I can generate away with in bed for fun, would make my day :)
Doable? Yes, but SD 1.5 has fewer parameters, so you will never get the same knowledge as the SDXL based one.
@bl4ckfuture107 yes, but there is definitely still devices that can't run full fat SDXL mostly phones
@GPUPoorChad I agree with you, really. I would love to see an SD 1.5 implementation to run on my AMD NPU
Do you need to have regional prompter turned on in order for v0.8 to function properly?
No, you use BREAK without modifications to split the prompt in ~75 token chunks.
Any plans for anima?
Details
Files
Available On (3 platforms)
Same model published on other platforms. May have additional downloads or version variants.



















