~Amorous Lesbian Kisses~
Update: Ya'll the Wan version is pretty fire I'm quite pleased with it. I'm gonna try to replicate those results for Hunyuan now!
Kisses for Wan: It's been a long time coming, but I've finally successfully created a Wan version of this model! It seems competent for both T2V and I2V. A big key was using 16fps, Wan's native, so if you train Wan I'd definitely recommend that! FWIW my example videos have been interpolated to 32FPS using https://github.com/GSeanCDAT/GIMM-VFI which is really excellent. Anyway, I trained it at 480x272, 69 frames at 16fps times 30 videos for 2400 steps at 2e-5 with loraplus of 4 using Musubi Tuner. I removed the leading "amorous kissing" but otherwise the prompting format remains the same:
"close up of two young women tongue kissing. The woman on the left has red hair and is wearing a black lace choker, the woman on the right is Indian, with beautiful light skin and long straight black hair."
Tongue kissing, making out, kissing, wide shot, medium shot, close up should all be hotwords! Wan especially picked up the "making out" keyword really nicely and if you include it you will get lots of caresses and touches. It also manages tongue interactions better than Hunyuan. My examples were made with Musubi Tuner in about 20 minutes each! I use Musubi with a scheduled CFG, I do the first ten steps and the last three, but every other otherwise. This gains good speed without sacrificing much if any quality! I've also been experimenting with skip layer guidance which is curious and seems to really boost quality. Oh I also use fp8 scaled which is a huge boon. Musubi's implementation is online, which means you start with the full model(not the pre scaled ones). It keeps some smaller but very important params in full precision while quantizing the weights themselves to fp8 maintaining only 2.5% quantization error(vs 12.5% for a naive cast to e4m3fn!). I've ran several same seed comparisons and it's not just good in the numbers, it's consistently the closest results to the full unquantized model of any method I've tried. Comfy has fp8 scaled too but it's done differently(the weights are saved scaled and you just load that) but I hear it's really good too. Hurray for democratizing access!
Original/Hunyuan:
This has been a tough nut to crack, likely because of the complex hand and tongue movements involved. Base Hunyuan will make simple platonic kisses but not much more. This LORA is focused on creating amorous, sexual kisses and making out between women. It was trained on my RTX 4070 Ti SUPER 16GB with Musubi Tuner in 12 hours. This is the first revision worth sharing, it's not perfect but can definitely make some nice things! Expect updates! Caption/prompting format:
"amorous kissing, medium shot of two nude young women tongue kissing and making out with each other in a living room. The woman on the left has her brunette hair in pigtails and a tattoo on her arm while the woman on the right has brunette hair in a ponytail. Behind them a couch with some pillows and some plants can be seen."
"amorous kissing, wide shot of two women laying on a gray couch in each other's arms, making out and tongue kissing passionately. They both have brunette hair, one is wearing a colorful haltertop and shorts and the other is wearing a white dress"
"amorous kissing, close up of two women kissing sensually in front of a bright window. The woman on the left has red hair and is wearing a black jacket, the woman on the right is wearing a beanie and thick black glasses. Both of them are wearing mascara"
Small note: "making out" was used to indicate lots of caresses and occasional sexual touches accompanying the kissing, but I don't think it took super well in this first revision! "tongue kissing" was used when there was a lot of visible, outside the mouth tongue action, "kissing" if not as much or it's contained inside the mouth. "wide shot" was used if the full body is visible, "medium shot" for waist up, and "close up" for the close ups. Oh and "passionately" was used as a modifier if the kisses were extra enthusiastic compared to the dataset overall.
Recommendations:
Weight: 0.8-1.0
Flow shift: ~9.0 @ 544p
Guidance: <= 7.0 (Too much creates more issues with hands)
Steps: 50
Frames: 61-129 (longer may or may not work, wasn't trained)
*Reports and my experiments indicate that Teacache may create issues with the LORA so please try without it if possible.
Dataset consisted of 26 high quality videos of women of various ages and races sharing various types of amorous kisses and making out from various distances in various states of undress. The source data was preprocessed with ffmpeg into the training clips which were each 144 frames long at 24fps showing only the action of interest with no scene cuts or dramatic camera movements. Further they were cropped to show only the women in order to add some aspect ratio variation as 95% of the source was 16:9 before processing.
Training config:
Network dimension: 36
Network alpha: 1
Learning rate: 2.4e-4
Optimizer: came_pytorch.CAME
Optimizer args: weight_decay=0.01, eps=(1e-30,1e-16), betas=(0.9,0.999,0.9999)
Steps: 2400
Warmup steps: 100
Scheduler: Constant with warmup
discrete_flow_shift: 7.0
timestep_sampling: shift
VRAM savings: --blocks_to_swap 31, --split_attn, --flash_attn
Dataset was listed four times in the toml to allow processing different frame bucket lengths at different resolutions:
[general]
caption_extension = ".txt"
enable_bucket = true
bucket_no_upscale = false
[[datasets]]
video_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses"
cache_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses/cache0"
resolution = [480, 272]
target_frames = [129]
frame_extraction = "head"
batch_size = 1
[[datasets]]
video_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses"
cache_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses/cache1"
resolution = [640, 360]
target_frames = [69]
frame_extraction = "uniform"
frame_sample = 2
batch_size = 1
[[datasets]]
video_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses"
cache_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses/cache2"
resolution = [848, 480]
target_frames = [41]
frame_extraction = "uniform"
frame_sample = 2
batch_size = 1
[[datasets]]
video_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses"
cache_directory = "/home/blyss/projects/art/extra/dataset/AmorousLesbianKisses/cache3"
resolution = [1280, 720]
target_frames = [1]
frame_extraction = "uniform"
frame_sample = 2
batch_size = 2
Description
First viable version. Definitely usable but has some occasional issues, especially with the "making out" tag hands sometimes... do un-hand-like things.
FAQ
Comments (13)
Great Lora!
One of my favorite lora to use now. Nice work
enable teacache and setting 0.15 on HunyuanVideoWrapper noed ,vids speed become to slow motion.
can you explain what
frame_extraction = "uniform"
frame_sample = 2
do?
frame_sample = 2 means two sets of frames will be extracted from each source video, uniform extraction means they will be extracted uniformly spaced from the source. So for instance for the [69] bucket, from each input video it will extract two clips of 69 frames spaced evenly apart(can overlap) given the available frames in the input(in this case 144, so likely it took the first and last 69 frames). There are a few different extraction types, each with their own options but it's kind of hard to explain them with words. A visual guide is here:
https://github.com/kohya-ss/musubi-tuner/blob/main/dataset/dataset_config.md#frame_extraction-options
Edit: For my specific use case, head only extracts one time for each N specified in [N, N, N]. So for [1, 35, 69] it extracts a 1 frame sample, a 35 frame sample, and a 69 frame sample /all starting at the beginning of the video./ But with uniform, I can specify how many samples I want to take and so get more variation from other parts of the vid. Note that the very long frame bucket I only did a "head" sample, but the shorter frame buckets I took 2 uniform samples from each. That was just my best guess at balancing it, though it does seem to have worked pretty well. Without the lowres 129f bucket in play, it struggled to make longer kisses. Note also I did a batch_size of 2 for the single frame buckets since they need very little VRAM comparatively. It was my hope this might help generalization, given some of the things people are making it definitely didn't hurt!
@blyss Thanks for detailed explanations! How do you know for how many epochs/steps should you train LoRa for? For example, at the moment I am training LoRa on 57 images for 500 epochs, 14hrs already passed - steps: 48% [13650/28500 steps, avr_loss=0.122]. What was your avr_loss at the end of training?
@guy33 That's a LOT of steps! I have been targeting around 2000 steps with my runs so far with LR around 2e-4. Where my loss starts and ends seems to vary with the dataset. For this video dataset here, I started around 0.1 and by the end was around 0.065. However for an image only dataset I've been working with it seems to start around .19 and end around .12.
For my image only dataset, I've been having trouble reaching full convergence though. It gets like 90% and then just peters out. For instance it learns the woman's face and hair quickly but not the two dark freckles on her face even after the full run. I'm wondering if discrete_flow_shift might need to be tweaked when training image only. I've also been testing training only the double blocks with --network_args exclude_patterns=[r'.*single_blocks.*'] for image only dataset to try to achieve better motion when training with images. Lots of testing going on!
Edit: Updates:
It seems for fine details further settings refinement is required. Per the discussion at https://github.com/kohya-ss/musubi-tuner/issues/54 I'm training my image only set I mentioned above with timestep_sampling sigmoid discrete_flow_shift 1.0 and the samples show it's learning her face much better, including her freckles already at only step 400. However loss is looking WONKY so... we'll see how it goes. It does seem that training character/image only LORA versus just the double blocks helps preserve motion when using the resulting model!
Please do a WAN Version. THX !
I have attempted this but so far results haven't been great with my dataset. There is https://civitai.com/models/1333275?modelVersionId=1574869 and https://civitai.com/models/1392031?modelVersionId=1573366 for Wan though!
ohh , Thank you for your reply. Great news. I will try!
@jenf I made it anyway! Thanks for kicking my ass to finally get it done lol