Overview
This full FT aims to fundamentally improve the SD1.5 model. It includes multi-character display, pose diversity, stable body structure, and additional information.
The base model is an anime-style model incorporating elements of NAI2, and I aim for version 1 through repeated small-scale FTs of several thousand images. I plan to create several base models as raw materials, then improve the learning method while ultimately merging and adjusting them.
High-resolution output is supported to some extent, but is not recommended at all.
Although not specifically mentioned, all samples are low-resolution output via LCM.
Note: Since this is SD1.5, please specify what you want to output first. In many cases, the quality prompt is just a nuisance.
I now have five types of FT materials. I'll stop using FT model materials for now. I'll combine these five with existing materials to explore new models.
Qwen's output isn't particularly interesting, but it's stable and rarely breaks down, so I plan to use 0.3 (which may need expansion) as a base and supplement it with NSFW elements like 0.4.
When combined with existing models, something like TeatimeDream Neo will be created.
ver.0.50
This is a semi-real version of a series where I'm trying out some of the materials used in LucidDreamer for Z-Image for SD1.5. In Z-Image, I tried mixing in data with strong individual characteristics, but that failed, so for SD1.5, I'm only using cleaner images for training.
In version 0.4, fingers and other parts started to look distorted, so I decided to roll back and remake it, but there wasn't much improvement. The cumulative training amount has decreased, so there's a possibility of a downgrade in specifications. Were the fingers in SD1.5 always this bad?
Just doing a full-frame analysis resulted in unnaturally low quality for the hands, so I made a slight adjustment with LoRA (I hadn't worried about this before and released them, but the rollback wasn't worthwhile).
This time, I dealt with it by adding HandEnhance. It seems to be occasionally used in CIVITAI generation, but if I may say so myself, it's difficult to use. This time, I used about 20% mainly from ver.0 and a little less than 10% from ver.1 as a supplement. Using too much will strongly affect the image. However, while a perfect solution may not be possible, there will definitely be an improvement.
Incidentally, I've also added an eSSR-type LoRA sensor to enhance the perceived resolution.
ver.0.41SR
What I learned from retraining is that even using NAI2 as a base, the resulting model is significantly inferior to 0.4SR. The base model lacks sufficient power. Therefore, I merged the additional elements from 0.31 (included in the 0.4SR base) into the partially trained 0.41SR model and ran it through the final FT, but even then, the expected level of detail accuracy was not achieved.
0.4SR is superior in terms of fidelity to prompts. While 0.41SR can produce diverse images, it is inferior in terms of fidelity. Details and body structure may be comparable or slightly improved.
On the positive side, the overall tint color has been reduced. Also, in 0.4SR, the image would break down with large numerical specifications like Asian:2.0, but this tendency has disappeared.
It is possible to create an image with a similar composition to 0.4SR by using only the clip portion from 0.4SR (this is easy with ComfyUI, and also possible with a simple merge when using Forge, etc.).
ver.0.4SR
This model was created using the dataset used in LucidDreamer Z ver.0.7, based on NeoSD ver.0.31RE, with a finetuning (FT). The Z-Image version is similar, but it incorporates a new training method, which also serves as a test. Because SD1.5 offers a five-fold faster training speed, and Z-Image training itself is unstable, this version was released first.
The new method I've been developing for quite some time has had both successes and failures, but I've made significant progress in accelerating development speed. Diversification of composition and poses, as well as color preservation, have been achieved to some extent. However, the body structure and fingers are still somewhat unstable because the training process involves dismantling and then rebuilding them. Well, it's SD1.5, so please understand. Also, there are still poses that the model struggles with.
This time, I used a base model that retains elements of NAI2. The main motivation for introducing NAI2 previously was higher resolution, but it turned out to be inefficient, so in version 0.33, we reverted to a direction without NAI2. However, NAI2 includes elements other than higher resolution, and I think we should focus on those this time. In the next version, we might base the model on NAI2 itself. This is because it seems more advantageous to do so assuming a large amount of training that doesn't rely on SD1.5 assets.
The current model should have around 2.5 million* training steps, but it's processing within a realistic timeframe. Of course, there are disadvantages to speeding it up, so I'll probably be testing and adjusting it for a while. It's more suited to DiT models, so SD1.5 requires specific adjustments.
Also, because the dataset was configured for Z-Image, it sometimes contains excessively NSFW representations. The elements that need correction in that model are concentrated in that area. This is partly because the prompts used to generate the training images were collected from CIVITAI, so there are naturally many of those.
* (Initially I had set it at 3.5 million, but after recalculating, I lost track of the exact number of steps. I'm pretty sure about the 1.3 million for the first stage and the 400,000 for the third stage, but the second stage is uncertain. It's at least around 800,000, so I've revised it to 2.5 million.)
ver.034
This is a model based on version 0.33, trained to output SDXL-based images. Simple prompts produce reasonably good images, but body stability is compromised. The prompts also require a more detailed description of the scene. Noisy information, such as artist selection and quality prompts, often degrades image quality (although this is not limited to this version).
ver.0.33 Anime
This is a model in the same series as 0.31 and 0.32. The number of materials has been increased, especially the background materials.
A major change is the removal of NAI2. This was due to the decision that there was no benefit to producing large images with SD1.5. NAI2 does have other advantages, but as the ratio decreases, these also become weaker, and it is thought that they can be replaced with new materials.
The model is left finetuned and unadjusted. The sample image size has been slightly increased, but HiRes.Fix and other effects have not been used.
ver.0.31RE
I am paying attention to the Qwen-image series as a model that includes many parts that are lacking in SD1.5. I think the 0.31R was also a good base model. 0.31RE is an example of its application.
Since it's Qwen-image, it's natural that there will be a lot of Asian characters, but some people may not like that. As a photo model, there are some parts that are quite imperfect.
Therefore, this version is a European-style output while preserving the composition as much as possible. Generally speaking, you can think of it as just replacing the OUT side with the SDXL series. To be more precise, I replaced about 90% of the OUT layer with a model that learned the SDXL series based on 0.31R.
When using Distill SDXL, there are fewer cases where you have to take your clothes off all the time.
ver.0.31R
Ver.0.31 was like a distillation of the anime-style parts of Qwen-image. Ver.0.31R is like a distillation (sort of) of the photo-style parts of Qwen-Image. Because it's small-scale, the effect is limited, but it still produces images with the Qwen-image vibe. It also includes some AuraFlow materials.
The model is generated in FT as is, with no special adjustments.
Naturally, the image won't be the same as Qwen-image. Faces tend to appear smaller, so HiRes.Fix is used in the sample.
ver.0.32L
I tried using LoRA to compensate for the unstable parts of 0.32. Anime images are now relatively stable, but since I added a lot of character elements in one LoRA, there are a few more NSFW elements. This was one of the LoRAs I've been using for a while, but because I created it by crawling CIVITAI's anime drawings and captions, there was a problem with the NSFW elements being too strong. I adjusted the layers before using it, but applied it a little too strongly to correct it. Even with this, some of the images still didn't look like anime drawings.
It's not a big problem. Ideally, you should adjust it using multiple LoRAs, but this produces some interesting images.
ver.0.32
When I was checking the data from ver. 0.31, I discovered that some of the caption data was missing entirely.
The extension of some images, or rather the format of the referenced files themselves, was incorrect. I thought I had corrected that and other minor issues with character codes, but there are some areas that are working well and some areas that are not working at all. In addition, the convergence rate is lower than last time. I imagine that it will probably settle down after around 150 epochs, but I extracted data from 90 epochs here.
It is disappointing that the basic issues have not changed much and that the quality cannot be said to have improved, but this version has corrected the errors in the previous data.
Ver. 0.31
Last time, I mentioned that the version 0.3 series, which primarily uses the output of Qwen-image, would be the base model. However, since 0.3 had an extremely small number of image resources (Qwen-image's images barely change even when the seed is changed), I added more resources and reworked the base model to create version 0.31. While stable, the Qwen images were a bit boring, but I've tried to add some variety.
In fact, version 0.3 was a model that trained with an unprecedented convergence rate, but adding more resources has made it less stable than expected. The body structure and fingers have become quite unstable.
More unexpectedly, the images are unstable. I intended to produce stable anime images, but sometimes the images end up looking semi-realistic. Try removing prompts like masterpiece and best quality (in some cases, it may be better to add them). This may be due to remaining issues with the base model or captions.
As such, the release of versions 0.32 and 0.33 may be on the way.
That said, I think that the 0.31 is a model that can produce images that have not been seen in previous SD1.5 models as a base model material. However, since it is in an unadjusted state after FT, I do not recommend using it alone.
As usual, this sample is the 512x768 pixel LCM output as is. Faces at mid-distance should obviously be processed using HiRes.Fix or Adetailer, but no processing is done.
ver.0.5
This is a model with large movements. While convergence wasn't bad, the image wasn't stable, so I ended up training it for 100 epochs.
ver.0.4
This version uses different materials and more images than before. Approximately 10,000 images were used, and it took 60 epochs.
The learning convergence rate was slow, affecting the body structure and the details, but it produces beautiful images when it gets right. It uses materials from a similar series to 0.1 and 0.2, so it produces similar images.
It has clear strengths and weaknesses in responding to each prompt, and may have some quirks. Since it's primarily for material use, I'll consider how to utilize it when merging.
ver.0.3
This is based on the output of Qwen-image. There are earlier versions, but they have a Qwen-like feel that's almost laughable, even down to the SFW elements. ver.0.3 itself was regenerated without those elements, so the Qwen feel is somewhat diminished. This time, due to issues with the Qwen environment, VAE had problems, resulting in inferior finger accuracy and color reproduction. However, I still think it's not a bad new material for SD1.5.
ver.0.1+0.2K
A simple tweak didn't make it look very cute, so I added some cutesy LoRA (which I don't usually use because it has strong side effects). If it works, it can be used as is, but the fingers and other parts tend to break down easily. Would it be better to apply it only to the face in Adetailer? (Would it have been okay to just release the LoRA?)
ver.0.1+0.2
Merge example. This is a combination of the ver.0.1 composition and the ver.0.2 character and painting style, lightly applying my usual LoRA tools. I focused on the details of the mid-distance face and the background. I only polished up some rough edges, but I think it's good enough to be used normally.
ver.0.2_38
This version is made using a completely different material series than ver.0.1 (though there are many similar images). I think this version is more stable in terms of character and anime illustrations, but the variety of poses is inferior to ver.0.1.
ver.0.1_41
While it worked reasonably well, I felt that 100 epochs was excessive, so I reworked this version in 41 epochs, revising the materials and changing the captions. In exchange for lowering the epoch, I increased the number of materials by 1.5 times (approximately 4,500 images). I also attempted to unify the anime art style. The details are a bit sloppy, and the fingers are a bit unstable. The facial details can be easily corrected with HiRes.Fix or LoRA, so it shouldn't be a problem. Do you need a few more Epochs? If anything, there seems to be a tendency for the body structure to become unstable when the Epochs are increased.
ver.0.1
This is the output of an anime-style model, fully fine-tuned for 100 epochs. This is my second full FT model.
It feels more stable than my first attempt, but the overall finish isn't quite there yet. It would probably be better to adjust it by merging, but I'll try FT alone for a while.
Looking back, I wonder why EtudeFT was so difficult. Perhaps it was a problem with the base model.