Originally Posted: https://ernie.baidu.com/blog/posts/ernie-image
ERNIE-Image is an open text-to-image model from the ERNIE-Image team at Baidu. Built on a single-stream Diffusion Transformer (DiT) with 8B parameters in a latent diffusion (LDM) framework, it ships with a lightweight Prompt Enhancer that expands brief inputs into richer, more structured prompts to better unlock the model's capabilities. With only 8B DiT parameters, ERNIE-Image achieves state-of-the-art performance among open weights text-to-image models — and it is built not just for visual appeal, but for controllability: accurate content depiction matters as much as aesthetics. In practice, it excels at complex instruction following, precise text rendering, and structured image generation — areas where many existing open weights models still fall short.
Key Features
•Competitive performance at compact scale: With only 8B DiT parameters, ERNIE-Image remains competitive with substantially larger models and achieves leading performance among open weights models on several challenging benchmarks.
•Precise text rendering: ERNIE-Image handles dense, long-form, and layout-sensitive text especially well, producing readable and faithful results in Chinese, English, and other languages.
•Robust instruction following: The model reliably handles complex prompts, multi-object relations, and knowledge-intensive descriptions, making it well suited for tasks that demand fine-grained control.
•Structured visual generation: ERNIE-Image is especially effective on images with clear layout or narrative structure — posters, manga/anime storyboards, multi-panel compositions, and cohesive multi-element visuals.
•Broad stylistic range: Beyond clean graphic design and illustration-style outputs, the model supports realistic photography and distinctive stylized aesthetics, including softer, more cinematic and film-like tones.
•Easy to deploy and adapt: Thanks to its compact size, ERNIE-Image runs on consumer-grade hardware (24G VRAM), bringing high-quality image generation within reach for research and production use. The moderate parameter count also makes fine-tuning and adaptation straightforward for researchers and developers.
FAQ
Comments (25)
So we got a new image model. I wonder how this compares to Flux 2 Klein 9B and Z Image Turbo?
Very satisfactory. There’s no real censorship in place; it just needs some fine-tuning. There’s no significant degradation of image quality when using long prompts. The overall performance is quite good. But for now, the only models available by default are Asian models.
Ok in general. Turbo got some diagonal artifacts. Visible on night lights. Might have limbs issue a bit more often than Klein and much more often then ZIT. Ok with complex prompt. But may put Asian instead of explicitly mentioned other race. Have Built-in prompt enhancer. That helps with short prompts and not that much with long. PE translates prompts to Chinese BTW but despite that increase diversity of faces/race. May generate naked woman. But also may ignore this part of prompt. Text is better be in “” it may render it without, but much lower quality.
Another fast visual comparison using 20 images (test created by Gemini)
- https://wiki.liutyi.info/display/AI/ERNIE+Image+test+2.20.gemini
- https://wiki.liutyi.info/display/AI/ERNIE+Image+Turbo+test+2.20.gemini
- https://wiki.liutyi.info/display/AI/FLUX.2+Klein+9B+test+2.20.gemini
- https://wiki.liutyi.info/display/AI/FLUX.2+Klein+base+9B+test+v2.20.gemini
- https://wiki.liutyi.info/display/AI/Z+Image+Turbo+test+v2.20.gemini
Just to see how one of the top models goes thru the test
- https://wiki.liutyi.info/display/AI/Nano+Banana+2+test+v2.20.gemini
The model got Built-In prompt enhancer. So did a test with and without it. Same 40 prompts. Same seed. PE is ON. PE is OFF. Same will be done for Turbo version. now available ERNIE Image Turbo test v2 without PE . But test v1 is ready for Turbo in both PE ON and PE OFF . On HF there is a demo available for turbo to test the model.
I can't wait for the model to be quantized so it can smoothly run on my 5080. The images i have seen look promising and i will be keeping an i out
It is quantized already on Hugging Face. Full range of GGUF on unsloth.
@mrmrswiggly612 Thx for letting me know :D I'll check it out!
@ferretduck thx! don't know why it didn't show up when i searched civit.
Asian faces predominate.
Guess Baidu seems not Western company
No worse than ZiT. Just add "caucasian" to your prompt and :gasp: it stops happening.
I just did a test on HF, result? zImage and Klein 9b is better in terms of realism. But Ernie is slighty better in prompt following, with one shot it follows ur prompt accurately while in zImage you have to try 3 times to get it right.
Impressions so far
Pros:
Blazing fast. It’s twice as fast as zimage turbo (tested on an RTX 5080).
Perfect rendering of feet and various types of socks.
Accurate prompt comprehension with solid adherence.
Once the prompt is locked in, it enters a stable "gacha" (rolling) rhythm, consistently producing high-quality results.
Cons:
Characters are unattractive and lack variety. Additionally, celebrity name keywords are ineffective.
The probability of anatomical errors (like three legs) is relatively high.
the step time is lower than zit ?
Seems to be less censored than other models, great with text, fairly fast (but turbo model not as good as base) but also very hard to get fine details on things and hands and feet not always good. It can't generate over about 1500x1500 without creating body horror.
Its also SUPER easy to train
Are you using ai-toolkit to train? If so can you help with decent settings for it plz. I tried a test and couldn't get any Loras to work with default workflow
Ernie Tutbo works 30-50% slower than Z turbo. It is impossible to generate at high resolutions
On my, intel Core i9-12900KS, 5090 FE and 64 GB Corsair Dominator Platinum DDR5 6800 megatransfers per second RAM, this model is super fast, but as far as NSFW , or even just sexy women in general, I still like Z-Image better.
Made me laugh. Z-Image is at the end of the tunnel in this regard. This applies to sex.
Treat me like I'm a big dummy. How do I use this in Neo? I've downloaded the model, but I can't seem to figure out the additional required files or how to set up the ERNIE interface.
I downloaded the FP8 version, but i'm pretty sure it's the prompt enhancer that keeps crashing comfy. Is there any workflows that dont use it?





