Introduction
Everything written below is based purely on my personal experience and observations. I can’t guarantee everything is correct, so take it as reference material only. Discussion and corrections are always welcome.
As an open-source model focused on photorealistic image generation, Z-Image-Base (ZIB) 官方页面 is easily one of the current SOTA models. Its image quality is extremely strong, and its prompt adherence and fine-detail control are honestly on another level. For example, you can modify the style of a single button purely through prompts without noticeably affecting the rest of the image.
However, ZIB is much weaker when handling spatial relationships between multiple subjects, especially multiple characters. Basic two-person poses are usually manageable, but uncommon poses often fail badly, with frequent misalignment and anatomical issues.
For threesome or more complex interaction scenes, no matter how detailed the prompt is, while it can occasionally generate good results, the overall experience feels more like rolling a gacha. Most outputs end up with broken anatomy or spatial errors.
Because of this, I started experimenting with ControlNet + mlutiple processors to guide composition more reliably.
I used the following combination:
After testing ZIB + 8 Steps LoRA + ControlNet together, I found that ZIB’s ControlNet can solve certain structural problems, but the resulting image quality still often feels lacking:
overall sharpness is weaker
lighting tends to look flat and gray
prompts related to lighting respond poorly
ControlNet strength at
1.0often produces terrible-looking results
Under the 8 Steps LoRA setup, adjusting CFG (usually between 1 and 2) can sometimes help, but the workflow still feels heavily constrained by the base image. Some reference images even produce extremely strange lighting behavior.
Overall, the experience feels very different from SDXL workflows, where high-quality results often work almost out-of-the-box with minimal tweaking. I’ve also seen similar complaints on Reddit about ZIB’s ControlNet implementation feeling relatively weak.
Another important issue is that in ZIB, generation resolution directly affects focal length and depth-of-field behavior. Different resolutions can produce dramatically different compositions and camera feel. Solving this became one of the key goals of my workflow.
Workflow
After a lot of experimentation, I ended up building a relatively simple ControlNet workflow that gave me much more satisfying results.
The core idea is straightforward:
Use any checkpoint you like together with the 8 Steps LoRA
(This is extremely important. For ZIB, the 8 Steps LoRA is almost mandatory in my opinion. It significantly improves image quality and detail rendering.
Of course, if your checkpoint already has something similar baked in, you don’t need to add it separately.)Use ControlNet together with a suitable resolution — usually a relatively low resolution — and perform a short initial sampling pass (around 2 steps)
This stage establishes a solid base latent with correct character positioning and spatial relationships.Upscale the latent to the target resolution, then remove ControlNet and continue sampling using only the checkpoint + 8 Steps LoRA
This preserves the structural consistency from ControlNet while dramatically improving image quality, lighting, and detail in the second pass.
More importantly, you can freely adjust the second-pass resolution without heavily affecting focal length or depth-of-field behavior.The “small-resolution first pass + large-resolution second pass” approach also works well outside of ControlNet workflows
It helps reduce the coupling between focal length behavior and image sharpness/resolution.
Recommended Models
Personally, I use a 1:1 merge of:
Big Love
Pornmaster
I feel Big Love performs very well in anatomy and clothing structure, while Pornmaster produces character aesthetics that fit my taste better.
The merged result feels surprisingly balanced in actual use.
Recommended Sampler
Under ZIB + 8 Steps workflows, I strongly recommend samplers that inject noise at every step.
These samplers consistently produce better anatomy and micro-detail quality than more deterministic alternatives.
My personal recommendation is:
Euler A
Key Parameters
In this workflow, there are basically only three major parameters you need to tune.
1. ControlNet Resolution (First-Pass Resolution)
The first sampling pass establishes the base composition latent, so this resolution matters a lot.
I usually default to:
short edge =
768
This feels like a very balanced starting point.
In ZIB, lower resolutions effectively produce a “longer focal length / shallower depth-of-field” look:
subjects become larger
background elements become fewer and more compressed
the model focuses more attention on the main subjects
prompt responsiveness for character details improves noticeably
This parameter can be adjusted depending on your goal:
Situations where lowering or increasing this resolution helps
You already know the kind of depth-of-field or focal feel you want
Adjust this value to match the desired camera look.Your subject differs heavily from the reference image
For example:different body type
different pose
weak prompt responsiveness
Lowering the first-pass resolution enlarges the subject and reduces the influence of the original image, making the output follow your prompt more strongly.
The reference image contains distracting or unwanted elements
Lowering the resolution can help suppress them.
Though in many cases, adjusting ControlNet strength is even more effective.Extremely low resolutions (for example
128) are usually too destructive
The initial latent becomes too small, causing heavy detail loss and significantly reducing adherence to the reference image.
2. ControlNet Strength
This controls how strongly ControlNet influences the generation.
I usually use:
1.0
Without the second sampling pass, 1.0 often produces awful-looking results.
But in the dual-sampling workflow, 1.0 works surprisingly well:
strong structural adherence
while still allowing the second pass to restore image quality and details
3. Final Resolution (Second-Pass Resolution)
This is your final upscale sampling resolution.
I usually use:
long edge =
1536
This tends to produce clean and detailed images while keeping rendering mistakes relatively manageable.
Since the base latent structure has already been established during the first pass, the second-pass resolution has much less influence on focal length and depth-of-field behavior.
This gives you much more freedom to scale image quality independently.
Higher resolutions produce:
more sharpness
more texture detail
richer micro-details
But in very complex scenes, excessively large resolutions can also introduce:
incorrect clothing details
broken background objects
random hallucinated elements
In most cases, I avoid going beyond:
1920
The second-pass resolution generally has relatively little impact on prompt adherence.
Personal Experience & Tuning Tips
My default starting setup is usually:
first-pass resolution:
768ControlNet strength:
1.0second-pass resolution:
1536
Then I adjust from there based on the results.
If the generated subject differs too much from what I want
I primarily reduce the first-pass resolution to weaken the influence of the original image.
If that still isn’t enough — or if the resolution becomes so low that important details disappear — I also reduce ControlNet strength.
Typical lower limits for me are roughly:
first-pass resolution ≥
384strength ≥
0.8
Though in special cases, I’ve gone as low as:
resolution =
256strength =
0.5
If the reference image contains many characters or very small subjects
1536 may not provide enough detail density.
In those cases, I increase the second-pass resolution moderately to improve detail rendering.
Usually I stay below:
1920
Sampling Step Distribution
I usually use:
first pass =
2steps
You can adjust this depending on your needs.
For example:
if adherence to the reference image is insufficient,
you can slightly increase first-pass steps
Personally, I generally keep:
first pass ≤
3stepssecond pass ≥
6steps
When nothing seems to work
If repeated parameter tuning still fails to produce the result I want, I often take the best partially successful output and use it as the new reference image.
Then I repeat the process iteratively.
Surprisingly often, this works much better than endlessly fighting the original reference image.
End
Hopefully this workflow can help people struggling with ZIB ControlNet setups.
And finally, good luck to everyone — hope you all generate the images you actually want.
