????? Lora (+???????, +???????) [Taz] - WAN 2.2 14b / 5B / 1.3b T2V & I2V (Wan 2.1 & 2.2)

????? Lora (+???????, +???????) [Taz] - WAN 2.2 14b / 5B / 1.3b T2V & I2V (Wan 2.1 & 2.2) - WAN 2.2 5B ti2v

NSFW

About this version

I trained using the newly re-captioned dataset from the 5B model. The result is incredibly good. For the first time I'm pretty happy with the result. Give it a try. I haven't tested I2V, it should work for that though. Most examples are with lightning speed lora and low resolution (480x832)

Trigger word: PENISLORA

What can this lora do?

This lora can add ????????????? to both men or women viewed from the front/side. Other angles such as POV may have a backwards ????? head.

Other things it can now do:
Side view of the penis

Cumming / Cumshots

Blowjobs (its captioned for the words "???????" and "??????????" )

What can't it do?

No penetration in the training data. Also nothing from POV angle, though there is a few images from above and 1 POV video in the training data.

Sometimes ???????? with ??????? have the ????? slip out the closed mouth.

Recommended Settings

It works pretty good with the new lightning dyno high model. I'll link to it in my example workflow. I like to use dyno high model (no lightning lora), then for low I use the lightning v2 lora on the regular 2.2 low base model.

Dataset

84 images at 512x resolution

43 videos at 256x resolution

(I let DP pick the aspect ratio automatically)

This is the same exact dataset as the 2.2 5B model. I made no changes.

Training

I used the default diffusion pipe settings.

[optimizer]

type = 'adamw_optimi'

lr = 2e-5

betas = [0.9, 0.99]

weight_decay = 0.01

eps = 1e-8

I was baffled why it was taking so long to train the high until I realized after over 60 hours of training that I had put my videos in the images directory which resulted in the high being trained ONLY only on videos and twice (once with a very high resolution). Once I fixed this, I went back and trained from 11K steps up to around 13K with the images in the training data. The high model was fine without to be honest.

For the low, I trained it properly with videos and images the whole way, around 6K steps in I upped the image resolution from 512 to 1024 actually and didn't get an OOM (it fit around 24GB exactly). I trained it to around 10.5K steps. Also I trained the low on the full timestep range (0 to 1 instead of 0 to 0.85) from some advice, it may switch better over from high to low on the speed up lora with low steps.

I think I might do another version with more angles such as POV and from the behind to make this work for any situation. In that case I don't think it needs 10K steps per training session, epochs around 5K steps looked fine.

The results

I think it was a combination of improved captioning and 2.2 base model being better. But this lora turned out really well.

Description

About this version

I decided to try the new Wan 2.2 5B model. This uses an improved version of the dataset from the 14B lora. For advice of how to prompt with the model check the description of the 14B version. It says "Wan Video 14B i2v 480p" in the base model because civitai doesnt have an option for 5b yet. FYI it works for both image and text.

WARNING:
This lora is far from "done" and you more often then not will not get a great result but I don't know if I want to put more resources into trying to get it to a better place. Give it a try and let me know what you think of the 5B model, I'm not sure why the 1.3B/14B trained so much easier than this 5B model.

Trigger word: PENISLORA

What can this lora do?

This lora can add ????????????? to both men or women viewed from the front/side. And I've had luck mentioning "a man with an ????? ????? and a woman with no ?????" to not get the both of them to have the penis.

Other things it can do in theory but doesn't do that well:
Side view of the ????? (very hit or miss)

Cum (havent tested a lot but its not as good as 14b model for sure though it is captioned for it!).

Blowjob (its captioned for the words "???????" and "??????????" but you should still describe the action ie "a woman puts a ????? into her mouth and gives a ?????????? ????????" etc).

What can't it do?

No penetration in the training data. Also nothing from POV angle, though there is a few images from above and 1 POV video in the training data.

Recommended Settings

As of this post, wan 2.2 is very new, so just use kijai's wrapper and his example workflow for 5b (you can use my example workflow in the data I'll attach in this post with the captions).

Dataset

84 images at 512x512 resolution (an improvement of around 13 images from last time)

43 videos at 640x480 resolution (an improvement of 3 videos from last time).

I went through all the captions and completely re-wrote them for the most part. And I found 2-3 datasets which were incorrect and removed them. I also went into the images and photo shopped out some watermarks and tattoos. In theory the dataset is much cleaner now. The only mistake I made was I kept all the videos in 16fps instead of 24fps which 5b is trained on. I didn't notice any negatives in the motion though so I think its ok.

I was too tired to caption new data from scratch, so I cropped my images in BIRME then used joycaption to get a baseline caption to describe all objects in the photo. Then I asked chatGPT to fix the captioning. Then I went in and fixed all the wrong info and rewrote the "SUBJECT" of each prompt properly (since joycaption 100% gets it wrong all the time). It was a small dataset addition so I didn't feel the need to try local LLM.

Training

5B is straightforward, and is identical to 14b or 1.3b training. Just download the model data and you're on your way. I would say it trains very fast. I decided to up my batch size from 1 to 4 and repeats from 1 to 4. I also trained it in float8, which looking back was not necessary. It took around 21GB of vram of my 3090 with these settings.

I decided to try using automagic optimizer for the first time. Its neat because it automatically adjusts your learning rate on the fly. You could see the LR increase over time and then when loss plateaued it also arched down and did the same. I think I will use this from now on.

[optimizer]

type = 'automagic'

lr = 2e-5

weight_decay = 0.01

eps = 1e-8

Around epoch 38 is when I changed the batch size and that screwed up the tensor graph for some reason, but we saw a steady curve of loss until about epoch 150 (~6K steps). You can see every now and then a few quick drops in loss and then it going right back up (this is when I added new data). Overall it started to plateau around this time. This is where I started to get a bit frustrated with the process. 1.3 and 14b trained so much nicer. When the loss started to plateau ????? shaft shape was in pretty good form.

The results

Without this lora, you will get deformed meat cylinders but the ????? you get here is hit or miss. The shaft will usually be fine, and ????????? are a bit off. This time around I captioned them more often which may have been a negative. The ????? head unless seen from the front will often lack detail. With 1.3b and 14b I think success rate of a good ????? was really high, but here its pretty much 50/50. The further you go away from the ?????, the higher chance it also has to fail, close ups look better.