LTX Example Workflow
https://paaster.io/698559390a7a8c3988ee9e91#1NG7332gvq5L4A9yTnKw4INMZ00tgOFaCAEFqU8d-as
WTF?!
Why not ;)
I'm back again with a new experimental escapade that doesn't seem to be done commonly on here: a custom CogVideo Wan Video LTX-2 LoRA!
Training (LTX-2)
This LoRA was trained on a dataset of 143 clips with hand-revised captions.
These clips came from 54 unique source videos:
Real Life (42 clips => 74 clips, ~50%) mainly amateur clips from Reddit/RedGifs/Pornhub, and a couple studio-shot videos
Anime (7 clips => 14 clips, ~ 10%) drawn/3d animated clips
Furry (5 clips => 45 clips, ~40%) 3d animations
Around 80% of these clips had their own audio.
Trained for 4,000 steps (4.5 hours on a runpod H200 SXM) using the official training script: https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-trainer/docs/quick-start.md
Rank 32
Training (Wan2.1 I2V [OLD])
This LoRA was trained with around 110 or so clips of up to 11s, some of real life amateur/porn videos, some of animations
990 steps (90 epochs)
Trained on 1 x A100 SXM4 using https://github.com/tdrussell/diffusion-pipe
Training (CogVideo [OLD])
This LoRA was trained with 19 videos of solo male masturbation, mainly amateur vids.
4,000 steps
Trained using H100 for around 18 hours or so, using https://github.com/a-r-r-o-w/cogvideox-factory
Description
One of the first Wan2.1 I2V 14B LoRAs in existence!
FAQ
Comments (21)
I just tried this lora, and it gives numerous errors saying the lora key isn't loaded.
Hi OP! Not sure if you have any fix for this but when I use this Lora with 2d images, it makes the characters talk a lot. An example is this https://civitai.com/images/71688819
Any recommendations?
Hey :)
Uhhh, I think it was just the training data to be honest. It was mostly IRL vids with fewer 2d animations since the IRL was way easier to source. I think most animated clips I included featured the characters talking, so that would make sense.
I'm planning on making another Wan LoRA somewhat related to this, but as for quick fixes, I'm not really sure -- maybe you could try to use OpenPose to control the facial movements?
But yeah, the 2d part of the dataset was pretty limited.
Any chance of a wan 2.2 i2v/t2v version of this? or release of the training data so we can retrain on the latest version?
Another W for MEN
With 143 data clips, and these were the "best" outputs you could get in LTX2? That's scary...
It's still very much a work in progress!
There's quite a lot that can be improved. I think mixing the furry/animation/real life stuff probably confused it during training. Plus, all the hyperparameters are the defaults — there were no images, etc. Potentially this just needed longer to train because of how big the set is — idk.
If there is any interest in the dataset itself, I can try and find a way of publishing it.
@Spat1984 990 steps is almost nothing, you don't get real considerable results until 2-3k and something like this as its a very foreign concept-you'd want double that. Unless you're just tuning a small style or probably doing audio in LTX2 since that stuff trains faster.
I'm wondering if LTX2 is really going to ultimately take off. It's clearly more advanced than wan 2.2 in "theory", but are people going to stick with it long enough for it to have a baseline as adaptable as wan 2.2. If so that would be incredible. But where it currently is, i feel like the amount of work to shift it might end up being more like flux. Where it was never really transformed to be highly adaptable and NSFW.
@whitespider9999 idk why producers don't see potential in using both. LTX2 is theoretically and seamlessly infinite already, and does extensions of Wan outputs and adds audio if you want. With some latent manipulation cleverness you can setup 1-2 minute generations, just need a few motion loras to make it all work. I jumped ship because I was tired of 5 second gens, with only 3 seconds of good motion that I'd have to extract and edit and SVI is way more troublesome than it's made out to be.
@tenstrip 👋 The WAN LoRA was trained for 990 steps. This one was for 4,000 steps.
@Spat1984 Just in case it helps: IDK about LTX2 specifics, or video models in general, but with image models I have found that mixing realistic, illustration, 3D, etc, is perfectly fine if you don't train it in just one go. When training image models, I have had better results if I train just one set of similar images (same pose, concept, shot type, style, whatever), and then continue the training from the resulting lora with the next set of images. Even with just photos, I usually train first wide shots, then medium shots, and then closeups so it learns the details once it already knows something about whatever you are training. It's a bit more time consuming, specially because you need to split the dataset in different subsets, and manually do several training passes... And I don't even know if it will improve something with LTX2, or any video models, but it could be worth trying. With Flux and Chroma specially, it does seem to learn better.
@PepitoPalotes Yea, that's exactly what I wanted to try next. I think the dataset is decent, I just haven't gotten it to learn as effectively as it could. I will probably try training again using that method in like a few weeks.
@Spat1984 lol my bad. Did you follow their guidance and do a bunch of complex tagging about the character/setting/other random stuff? I'm thinking that's not actually a good idea, just very basic motion tagging only for loras.
@tenstrip Np :) Yea, they are the long descriptive-type captions.
I mostly followed their guide, including using their automated captioning script. But, since vanilla Qwen Omni 2.5 isn't consistently great at NSFW tagging, I spent like a good few hours manually editing/adding NSFW details to the captions.
I actually have no idea about what would be better. I opted for the long-form stuff simply because in the last LoRA, I did short tag-like prompts and wanted to do it differently this time.
@Spat1984 Trying to get way higher resolution always helps and don't train the full size videos into buckets if you're doing that- crop squares around the concept and probably just focus on the shot of a hand going up and down the shaft from a few consistent angles at first. Simple tagging like "a man strokes his penis with his hand." I already did simple tags, but this was on reddit going in to detail about why auto-tagging isn't good for lora concepts it's mainly for massive datasets. https://www.reddit.com/r/StableDiffusion/comments/1qftepq/you_are_making_your_loras_worse_if_you_do_this/
@tenstrip I've been looking around and some of these lora's seem to be pretty good. I probably just didn't have the skill to pull off anything good myself yet so i projected that onto everything else. I'll give it another shot. Might have not really understood my own workflow.
@tenstrip - I hate to get involved with experts here... but steps alone is meaningless. Without knowing other directly related parameters, you have no idea how much work gets done in 990 steps.
Edit - "Trying to get way higher resolution always helps" what does this mean?
@Roscoe_P_Cold_Train I'm assuming the dataset is good; if it's not then yeah steps wouldn't matter and nothing would matter. Dataset is actually like 99% of the training. Resolution is just bucket size, training larger is always going to show a lot more detail and usually a sharper level of detail. I haven't tried a super high LR on LTX2 though, I've only tried 2 with a slightly higher LR and it still learned a lot of points in the dataset at different speeds, with sound and more familiar human motions coming through a lot quicker. 990 steps is a low total step count on like any kind of training unless it's on 5 images of one person.
@tenstrip - I only meant that batch size and gradient accumulate steps and repeats all are factored into the math that determine how much work is done on each pass.
Additionally, too high of a resolution for training is counter-productive and can lead to overfitting on the quality of your data instead of the content of your data.
If you have 100 images at batch = 1 and GAS = 1 and repeats = 1... that's 100 steps per epoch.
But if you use batch 8 and GAS 4 and 2 repeats.... you get 7 steps per epoch.
The point is that "number of steps" is meaningless without knowing these other parameters.
990 steps could be one epoch, or it could be 1000 epochs, we have no idea without knowing the configuration.
Training at high resolution is often why some LoRAs on civit ruin simple outputs.
300 pixels is plenty to train a face. 200 pixels is enough to train tons of concepts.
The base model is responsible for resolution at inference time.
There is no practical reason to use a huge training resolution.
@Roscoe_P_Cold_Train Yeah there's the common image model training understand, but I'm talking about video/flow model training, especially for 'i2v' training where it's different hat. Also, these bucket sizes are really tiny, I'm not talking about 2k, we're talking 256x256 v.s. 768x768 here. When I was training a lora for wan that introduces and draws a penis, doesn't have to be sexual, could have been anything; it draws a new object into the scene of an i2v output. Training at 512 was resulting in the object being blurry and disfigured. Took the sharpest points in the dataset and made a 1024x1024 bucket for those and remade the rest from 512x512 to 768x768. Vastly superior output and detail from that version, the new object was drawn into the scene with subtle details, even when drawn in a scaled down version. A lot of the flow data is math that is vectorized and can scale up or down, best to teach it detailed and large. Batch size is also irrelevant for video model training, its always gonna be 1xsteps. if you're higher than batch 1 you're on a super computer or something.
Details
Files
Available On (1 platform)
Same model published on other platforms. May have additional downloads or version variants.

