This model is an Unstable Diffusion community project and wouldn't have been possible to train without their generous help and resources. A heartfelt thank you to them and everyone who's contributed their data, time, and experience!
You can prompt unlimited generations with this model on mage.space
Please check the 'About this version' tab to the right for recommended positive and negative prompts and some training details.
Some advice ↓
This was trained at 768x768 with aspect ratio bucketing, so resolutions below 768^2 will likely give you bad results that I can't support with this model. Try 768x768, 768x960, and 960x768 for your standard resolutions.
If you're having trouble keeping the face in shot try mentioning their gaze ('looking at viewer' 'looking afar' 'looking to the side' etc.), facial features like their eyes, hair style, or expression, or putting 'head out of frame' in the negative prompt.
This model won't do well with most slang terms as they likely weren't in the tagging. And be careful recycling old prompts. Explore the effect of each tag or at least get rid of weighting and consider starting with a sample image to get a feel for it.
There were no watermarks in the data set, so 'watermark' 'artists name' etc., are meaningless. You don't need to beg it for 'high detail' 'photorealism' '8k' 'uhd' etc. It'll do that right out of the box. Save yourself some token space and prompt for what you want to see.
If you use this model as part of a mix or host it on a generation service, please mention and link back to this page (especially if you're making money off of it)
Description
Recommended (but subject to change)
Positive: masterpiece, best quality, high quality, realistic
Negative: worst quality, low quality, anime, digital illustration, 3d rendering, comic panel, scanlation, multiple views, artist name, signature, error, text, cropped
This was trained on top of a merged base model consisting of: Grey Model, seek.art MEGA, Bara Diffusion, A Certain Model, RPG v4, and the Artstation model. These models were mixed using Bayesian Merger.
~120k images were captioned using a combination of BLIP-2 sentence captioning, booru tags, quality modifiers, and the sfw/risque/nsfw tags (depending on the content of the image). Booru tagging was done first, and that result was used to help condition the output of the BLIP-2 captioning. This helped with coherency and gave it a much needed lewd tilt.
It was trained for 30 epochs using the EveryDream 2 fine-tuner, a base resolution of 768 (so lower resolutions aren't recommended when prompting), a batch size of 10, clip skip 1, zero offset noise set at 0.2, text encoder training turned on for three epochs, and a learning rate of 7.7e-7.