Example prompt:
missionary sex, POV overhead view of a woman lying on her back with her legs spread having sex with a man. She's on the couch. She is moaning with pleasure. A man is thrusting his penis back and forth inside her pussy rapidly at the bottom of the screen. The camera is zoomed out and holding steady. Her auburn hair is long and curly.
Version 1.2 Updates:
I used the same training videos as before, but this time I blurred the faces. I hope this created a non face altering version, and it definitely does seem to work better with character loras, and the facial variety also seems a bit better.
I used https://github.com/ORB-HD/deface to blur the faces in the training videos, and then added "woman with a censored and blurred face" to the captions.
Version 1.1 Updates:
I added a few new videos to train on, zoomed in more on the movement itself to hopefully train that a bit better. I also lowered the learning rate to 5e-5 and bumped up the number of repeats to 30.
The result seems to work much better at lower strengths and hopefully better with character loras now.
Note that this training took 8 hours compared to v1.0's 1.5 hours. There's probably some sweet spot for learning rate and training time to get good results, but it would take more experimentation to figure it out.
I don't feel comfortable sharing the exact mp4s I trained on, as they were just ripped from online sites and I don't really have the rights to distribute them. However, I will include my training data for the config files and captions so that other people can more easily get into training. I was surprised at how quick and easy it was.
I included an example workflow in the training data download (I can't find a better way to upload the workflow), which shows how to make it work nicely with multiple LoRAs, and has dynamic prompt support.
v 1.1:
I trained on a 3090 using 11 3 second videos (24 FPS, at least 50 frames each) and it took around 8 hours to do 20 epochs with 30 repeats.
v 1.0:
I trained on a 3090 using 8 3 second videos (24 FPS, at least 50 frames each) and it took around an hour and a half to do 20 epochs with 10 repeats.
Description
FAQ
Comments (35)
What checkpoint are you using for inference? I have rtx 3090 also but I am new to HunyuanVideo. There are bf16, fp8, ggufs, fast hunyuan, and possibly more
Personally, I'm using https://huggingface.co/Kijai/HunyuanVideo_comfy/blob/main/hunyuan_video_720_cfgdistill_fp8_e4m3fn.safetensors.
I haven't kept up with all the new versions coming out, so I'm not going to claim it's the best, it's just what I grabbed first when it became available.
If you download the training data, I included the workflow I used and you can see what models I'm using and their settings.
Hi, thanks for the files. Any tips on the captioning?
I'm not an expert or anything. I just wrote what I saw. I tried to describe what the woman looked like a bit, and the background, so it wouldn't be locked into those things. I captioned it manually since there were only 8 videos to do.
@dtwr434 Thanks! Makes sense, I'm trying to use ToriiGate-v0.3 right now, seems promising.
if lora weight is even a bit smaller that 1 then it becomes hard to create actual back and forth motion sadly. good try anyway, keep it up!
after further testing, 1 is almost guaranteed good back and forth motion, 1-0,85 is progressivly worse and at 0,85 its almost still
@TurboCoomer Yeah, it's a little weird how quickly it drops off, but is there a reason not to just leave it at 1? I've had pretty good success leaving it at 1 even when combining it with other loras. Just make sure you're using the "HunyuanVideo Lora Block Edit" node when using multiple loras.
@dtwr434 the reason is that it at 0.85+ it starts too much affecting body and face generation, making it look significantly closer to its dataset I guess, rather than just giving general scene and motion. I was combining this only with fast lora (negative value), without block edit, tho I used other loras same way and they ran well, but I'll take a look at block edit node, thanks
Probably training longer at a lower LR would help with this. 2e-4 is fairly aggressive so it's overfitting a bit to the 8 samples. I might try a run at the same approach with 1e-5 but need to take time to get together a similar dataset.
this is so good 11/10
Do you know what the resolution of the training videos was?
The most I can process at once on my 4090 is 34-frame clips. I'm impressed you were able to deal with 50 frames.
The videos themselves were 480p, however I specified 244 as the resolution in the config file. I think this may cause it to resize it automatically when training.
thank you for posting the training settings! did you try train this with "train only double blocks"? supposedly helps make it less blurry, haven't tried it myself yet.
No, that's not something I've heard of doing during training. However, when doing inference, I use double blocks only, so maybe that accomplishes the same thing? I'm not sure.
@dtwr434 probably. i'm setting up diffusionpipe with the gradio UI (https://github.com/alisson-anjos/diffusion-pipe-ui) and will test it at some point
Is it possible to train a lora with 16 GB Vram Card? -4080 Super
I'm not sure. Maybe if you reduce the resolution and number of frames enough? I'm not sure what the actual minimum amount of VRAM would be. Whether it would produce anything of value is another question. I guess you could give it a shot and see what happens.
@dtwr434 Ah okey, thanks a lot. Is there guide or workflow for this?
@caycay43 Actually, it sounds like the person that made the cowgirl lora did it on a 16GB card, so I guess it's doable. Sounds like you need 240 resolution and 17 frames for video based on what they said.
I haven't looked for any guides, so I'm not sure about that. I just followed the instructions here https://github.com/tdrussell/diffusion-pipe, and then read the comments in the example config files and tried something. I think the resolution and frame bucket settings are the ones that have the biggest impact on the amount of VRAM you need.
@dtwr434 I've been trying for days to follow what that person said, and as far as I can see it seems to be impossible. I have a 16gb card, and I run ubuntu natively on my pc (not wsl2), and I can't even make it train on a single image even with the resolution below 240, let alone 17 frame video. It immediately hits OOM on the very first training step. My only guess on why it is, is that nvidia drivers on linux don't support offloading memory to ram if you hit the cap and the same drivers on windows do so you hit the cap and offload using wsl2.
@asdrabael Ah, that's too bad.
@dtwr434 you're telling me. i gathered a whole dataset to try and make a doggystyle lora and just plain can't do it. making me wish i'd paid a little more and just got a 3090.
@asdrabael Damn. You could look into renting something with a beefy GPU and training it on that. I think some people are doing that. It might even end up better since you could likely bump up the resolution more.
@sgil There's also this, but I haven't tried it. https://github.com/kohya-ss/musubi-tuner
@dtwr434 I'm messing with that now. It has way more tools for offloading when you need to. I can actually train on it
what is your resolution and learing rate?
Fixed: It needs the VAE from:
https://huggingface.co/Kijai/HunyuanVideo_comfy/tree/main
even though I had that exact named file, but got it from the link to the example directly below. Now just waiting on download of extra files.
I seem to be having trouble with the VAE Loader. It looks like I have the hunyuan_video_vae_bf16.safetensor which was recommended by the official ComfyUI workflow: https://comfyanonymous.github.io/ComfyUI_examples/hunyuan_video/
I get this error message (just a tiny snippet):
HyVideoVAELoader
Error(s) in loading state_dict for AutoencoderKLCausal3D: Missing key(s) in state_dict: "encoder.down_blocks.0.resnets.0.norm1.weight"....
Or am I supposed to use the models from Kijai (I've installed the nodes to ComfyUI with ComfyUI Manager): https://github.com/kijai/ComfyUI-HunyuanVideoWrapper
I'm using your workflow in your Training Data zip.
I thought it might have been the checkpoint at first, but I've already gotten the checkpoint you use:
https://huggingface.co/Kijai/HunyuanVideo_comfy/blob/main/hunyuan_video_720_cfgdistill_fp8_e4m3fn.safetensors. So that's why I'm asking if I need a different VAE.
thanks! riding cowgirl Lora would be nice too
Someone else already make a pretty good one for that https://civitai.com/models/1058513/pov-cowgirl-position-hunyuan-video.
@dtwr434 https://www.pornhub.com/view_video.php?viewkey=649152fb681df Thanks, it is indeed a good one. Although I prefer a slightly different sex position with knees higher (like squatting). I have RTX 3090 and can train this Lora. Any guides how to do it?
@guy33 I don't know about any formal guides, but you can download my training settings from here and see how I configured everything. The rest is just a matter of cutting up 2 or 3 second videos at 24 FPS at the moment that shows the movement the best.
Personally, I used ffmpeg to do this, but there are probably fancier tools for video editing. Example:
Amazing... doggystyle next? :P
很感谢你的训练案例,请问你是用什么训练的?有没有相关教程
small question about used dataset: do the videos used in the dataset show women's faces, or only the lower part of their bodies?