Trained adapter to use LLM as text encoder for Rouwei 0.8 (and other sdxl models).

Update v0.2:

New version for t5gemma-2b text encoder model with improvement performance.

To run you need t5gemma-2b encoder model (ungated mirror, downloading instructions below).

You need an updated set of custom nodes to make it work

Detailed launch instructions and prompting tips below

What is it:

A drop-in replacement for Clip text encoders in SDXL models that allows to achieve better prompt adherence and understanding.

Kind of same as ELLA, SDXL-T5 and likely others, but this one is focused on anime models and advanced knowledge without censorship.

Key features:

State of the art prompt adherence and NL prompt understanding among SDXL anime models
Support of both long and short prompts, no 75 tokens limit per chunk
Preserves original knowledge of styles and characters while allowing amazing flexibility in prompting
Support of structured prompts that allows to describe individual features for characters, parts, elements, etc.
Maintains perfect compatibility with booru tags (alone or combined with NL), allowing easy and convenient prompting

How to run latest version:

1. Install/update custom nodes for Comfy

Option a: Go t ComfyUI/custom_nodes and typegit clone https://github.com/NeuroSenko/ComfyUI_LLM_SDXL_Adapter
Option b: Open example workflow, go to ComfyUI Manager and press Install Missing Custom Nodes button.

2. Make sure you have latest Transformers: Activate ComfyUI venv, type pip install transformers -U

3. Download adapter and put it into /models/llm_adapters

4. Download T5Gemma

Option a: After activating ConfyUI venv type hf download Minthy/RouWei-Gemma --include "t5gemma-2b-2b-ul2_*" --local-dir "./models/LLM" (correct path if needed).
~~Option b: Download safetensors file and put in into ComfyUI/models/text_encoders (implemented within next nodes update)~~

5. Download Rouwei (vpred or epsilon or base) checkpoint if you don't have one yet

6. Use any image from showcase as a reference workflow, feel free to experiment

Instructions for previous versions based on gemma-3-1b llm model can be found in this HF repo.

Current performance:

This version stands above any clip text encoders from various models in terms of prompt understanding. It allows to specify more details and individual parts for each character/object that will work more or less consistent instead of pure randomness, make a simple comic (stability varies), define positions and more complex composition.

However it is still in early stage, there can be difficulties with rare things (especially artist styles) and some biases. And it works with quite old and small UNET than needs a proper training (and possibly modification), don't expect it to perform as top tier open source image generation models like Flux and QwenImage.

Usage and Prompting with examples:

The model is quite versatile and can accept various formats, including multilangual inputs or even base64.

But it is better to stick to one of several prompting styles:

(Examples in showcase or in HF repo readme)

Natural language

kikyou (blue archive) a cat girl with black hair and two cat tails in side-tie bikini swimsuit is standing on all fours balancing on top of swim ring. She is scared with tail raised and afraid of water around.

Just a plain text. It is better to avoid very short and very long prompts.

Booru tags

Regular booru tags.

Until emphasis support will be added to nodes, avoid adding of \ before brackets. Also unlike with clip misspelling will likely lead to wrong results.

Combination of tags and NL:

masterpiece, best quality, by muk (monsieur).
1girl, kokona (blue archive), grey hair, animal ears, brown eyes, smile, wariza,
holding a yellow ball that resembles crying emoji

Most easy and convenient approach for most cases.

Structured prompting:

bold line, masterpiece, classroom.
## Asuka:
ouryuu Asuka Langley in school uniform with tired expression sitting at a school desk, head tilt.
## Zero two:
Zero two (darling in the franxx) in red bodysuit is standing behind and making her a shoulder massage.

It can understand Markdown # for seperating), json, xlm or simple seperation with new lines and :. Prompt structuring allows to improve results when prompting several characters with individual features. Depending on specific case, it can work very stable, work in most cases above random level, or can require some rolls bo allowing to achieve things impossible otherwise due to biases or complexness.

All together:

Any combinations of above. Recommended for most complex cases.

Quality tags:

masterpiece or best quality for positive.

worst quality or low qualit for negative.

It is better to avoid spamming because it can cause unwanted biases.

Current state of custom nodes does not support prompt weights and standard spells. Also (brackets) should be left as is, no need to add \.

Other settings and recommendations are same as for original RouWei

Knowledge and Train Dataset:

Training dataset utilises about 2.7M of pictures from this dataset and few other sources. Still quite a small number.

Training and code

Forward code example, obtaining hidden states from t5gemma example.

Sd-scripts fork for LORA training.

(More training code/trainer fork coming soon)

Compatibility:

Designed to work with Rouwei, works with most of Illustrious-based checkpoints including NoobAi and popular merges. Unet parts of loras works, TE parts need to be retrained.

Near future plans:

Custom nodes improvement including emphasis

There will be another version trained on larger dataset to estimate capacity and decision about joint training with encoder or leaving it untouched.

If no flaws are found, then it will be used as text encoder for large training of next version of Rouwei checkpoint.

I'm willing to help/cooperate:

Join Discord server where you can share you thoughts, give proposal, request, etc. Write me directly here, or dm in discord.

Thanks:

Part of training was performed using google TPU and sponsored by OpenRoot-Compute

Personal: NeuroSenko (code), Rimuru (idea, discussions), Lord (testing), DraconicDragon (fixes, testing), Remix (nodes code)

Also many thanks to those who supported me before:

A number of anonymous persons, Bakariso, dga, Fi., ello, K., LOL2024, NeuroSenko, OpenRoot-Compute, rred, Soviet Cat, Sv1., T., TekeshiX

Donations:

BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c

ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db

XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ

License

MIT license for adapter models.

This tool uses original or finetuned models google/t5gemma-2b-2b-ul2 and google/gemma-3-1b-it.

Gemma is provided under and subject to the Gemma Terms of Use found at [ai.google.dev/gemma/terms](ai.google.dev/gemma/terms).