LoRa, which I trained on my own computer, is still working on how to adjust parameters and improve the data set.
The current version is 0.1, waiting to be fully perfected.
Description
Trained with a combination of English tags and natural Chinese language generated by an LLM, for a total of 32 epochs and 10624 steps. The flow matching loss decreased from 0.37 to 0.341..., but the ideal range should be 0.22~0.23, with a very good value being 0.1~0.12.
Based on my training experience with qwen-image and flux, which I have not uploaded to Civitai as they are company assets, the issue is with the prompt structure.
Currently, I need to find a good prompt structure to solve the training problems, but this is difficult. Both pure natural language and pure tags have inherent flaws, and they should be integrated into a better, standardized structure. The quality of the Chinese prompt captions is poor, which is why the convergence effect is not very good...
From my past experience, finding a good prompt and structure can speed up the convergence rate by several times while maintaining overall quality, but this exploration takes a very long time.
Regarding some overfitting issues, I suggest lowering the weight to 0.8~0.9 or even lower for better results. I believe that the current base model's fine-tuning is still hampered by a poor prompt structure, which leads to insufficient training. Of course, the VAE is also a problem to some extent.
If there were enough funds, we would need to fine-tune millions of images with their corresponding prompts. This would require about eight to sixteen epochs to correct, and the engineering effort would be massive, requiring strong personnel coordination, which is very difficult. The learning rate could be appropriately lowered for directional guidance. (1e-4 with large batches -> 5e-5 or lower with large batches).