Just as the title says.. makes folks lick armpits :) No trigger word, just natural language.
Works both on men and women. With proper prompting you can make them lick anywhere. This was my first lora trained. Just 8 videos 256 resolution. 20 epochs. 1.5Hrs on a 4090.
I will most likely retrain this at a higher resolution.
Workflows should be in the sample videos. Training data has captions to try.
Thanks to dtwr434 missionary lora training data. Used pretty much the same configs.
Description
FAQ
Comments (12)
Hey! Thanks for sharing your config files. I took a look, and it seems you using the diffusion-pipe framework.
I was initially a bit confused by the resolutions setting in your dataset.toml – seeing [480, 720] globally but then [256] inside the [[directory]] block.
But yeah, the directory setting overrides the global one, so you were definitely training at 256px as you mentioned!
Seeing the high gradient_accumulation_steps = 4 and especially blocks_to_swap = 20 was quite surprising for 256px on a 24GB card.
However, it all made sense when I noticed in your training.toml that you loaded the BF16 versions of the DiT (...t2v_14B_bf16.safetensors) and the LLM (...umt5-xxl-enc-bf16.safetensors). Even though you correctly set transformer_dtype = 'float8', diffusion-pipe likely couldn't apply it properly because the source file specified in transformer_path wasn't actually an FP8 file. Loading those BF16 weights uses significantly more VRAM from the start.
Just a friendly suggestion if you want to experiment in the future: try using the native FP8 weight files! The ones that work well for me in diffusion-pipe are:
Wan2_1-T2V-14B_fp8_e5m2.safetensors' (make sure transformer_dtype = 'float8' is active for this!
umt5-xxl-enc-fp8_e4m3fn.safetensors'
Using these saves a massive amount of VRAM right away compared to the BF16 versions. With that extra VRAM headroom, you could almost certainly lower your gradient_accumulation_steps (maybe even to 1 or 2?) and drastically reduce or completely eliminate blocks_to_swap. This would make your training steps much, much faster!
Anyway, just sharing what works well for managing VRAM with this setup on the 4090. Cool LoRA!
Cheers
Thanks for the detailed suggestions! Will give it a try. If you do before me, please do share your training data. Thank you!
@aipm check mu LOrAs, i have included all dataset always configs
Its also worth noting that if you already have your data sets prepped and your triggers and your image descriptions and all of that set up and ready to go, you could always rent a persistent storage volume from a cloud gpu service and set it up to deploy your training system on a rented on demand gpu or farm of gpus. It sounds like a bunch of money but a 200gig persistent drive on most clouds is less than 20$/month USD, and places like hyperstack offer gpus for as low as 0.50 for an rtx a6000, and l40 can be rented for 1$/hr, the a100 for 1.34/hr, etc so on and so on, and they offer packages of 1x gpu, 2x, 4x, and 8x. So if you do a lot of training, the amount of time you save and the fact that you can utilize commercial cards with TONS of vram on them, (the a100 has 80g vram and has almost 2 dozen cpu cores over at hyperstack and its less than 1.50/hr) You'll save so much time and be able to utilize the full floating point precision of the uncompressed models which will ALWAYS yield much better lora results. Training on fp8 for example for video, verses the fp16 version of the same dataset, its a big difference. Worth the few dollars it will cost to train each lora, and you can do many more epochs with much "bigger" settings all across the board and still get done with a lora training in under an hour and have a boatload of epoch saves to choose from. hyperstack isnt the only one either, just the one I am using at the moment. lots of uptime, super cheap. A good way to do it is to generate your base images and prepare your data on your local machine using whatever gpu you got since images don't take up too much vram or resources when generating, and when you're done and have everything ready, go to your cloud, attach your drive to a virtual machine with the GPU you want, fire up the training session and crank out tons of loras. Once you get the hang of it, you can usually do something like 3 loras with all the settings turned up and training on fp16 or fp32 within an hour once you get efficient with your movements. hell. even ONE ultra-high quality lora for 1.50 or less is a good deal, especially since the gpu you'd rent costs as much as a new car.
@BreezyHeezy makes sense. Let me try that for my next lora. I did train a golden shower lora on my 4090 which turned out fantastic. Looks like civit does not want these anymore :) lol
@aipm Upload to huggingface and provide us the link XD
@BreezyHeezy Yeah, you're pretty much spot on there. For anyone seriously grinding out LoRAs, especially if you're doing a lot of them or need that massive VRAM for complex datasets or higher precision, diving into cloud GPUs like Hyperstack or others definitely makes a ton of sense. The breakdown you gave shows it can be surprisingly affordable for the power you get, and the time savings plus access to beasts like the A100 with 80GB VRAM is a huge advantage. Being able to crank out multiple high-quality models quickly is a big deal for heavy users.
That being said, for folks just messing around, treating it as a hobby, or only training models now and then for fun or the community, setting up and paying for cloud instances might be overkill. If you're not constantly training, the convenience and zero extra cost of just letting it run on your home rig (even if it takes longer) is often good enough.
Now, about the whole fp16/fp32 vs fp8 precision thing – honestly, I haven't really seen a massive practical difference in the final LoRA quality myself. Maybe technically there's a difference on paper or in very specific edge cases, but visually, for the stuff I do? It's subtle at best. Certainly not enough of a difference to justify the significantly longer training times and potential cloud costs that come with fp16/fp32, at least for my workflow.
Pretty much all the LoRAs I've put out are trained on fp8, and the feedback's generally been great. People don't seem to be complaining about quality issues stemming from the precision; if anything, they seem happy with the results. So, for me and probably a lot of others just sharing stuff casually, the speed and resource efficiency win of fp8 is way more valuable right now than chasing that theoretical bump from higher precision.
So yeah, totally agree – cloud is the way to go for power users churning out models constantly or needing top-tier hardware specs. But for the casual crowd or hobbyists, sticking local and using fp8 is often perfectly fine, faster for them end-to-end, and way more accessible. Different strokes for different folks, right?
any plans for img2vid?