V2!
V1 has some issues with blocky images and i removed low res images from training set which made end result slightly blurry. And also i trained with lower learning rate of 2e-4 for 100 epochs.
NOTE: there was a bug in the training script. So for now use this code with the specific commit. but i will update with v3 soon.
Use "Video of a transgender woman" at the beginning of the prompt to trigger it.
git clone https://github.com/kohya-ss/musubi-tuner.git
cd musubi-tuner
git checkout fd70762
pip install -r /local_disk0/musubi-tuner/requirements.txtpython ./musubi-tuner/hv_generate_video.py --fp8 --video_size 1280 720 --video_length 120 --infer_steps 30 --prompt "Video of a transgender woman with fair skin and long, straight white hair, styled with white cat ears. She is dressed in a revealing, white lingerie set, featuring a frilly, off-shoulder crop top that exposes her midriff and a matching ruffled mini skirt. She is also wearing white fishnet stockings that reach just below her knees. Her makeup is bold, with dark eyeliner, mascara, and pink lipstick, complementing her cat-themed costume. She has several tattoos visible on her arms, including a script tattoo on her left arm and a circular tattoo on her right forearm. Her miniskirt is lifted to reveal her erect penis. The background is dimly lit with a purple hue. The setting appears to be indoors, likely a bedroom or a private space, with some indistinct furniture and decor visible. The overall atmosphere of the image is playful and provocative, enhanced by the cat ears and lingerie. The woman's pose is confident and slightly provocative, with one leg raised, adding to the overall seductive tone." --save_path "./videos/" --output_type video --dit ./hunyuan-video-t2v-720p/transformers/mp_rank_00_model_states.pt --attn_mode sdpa --vae ./hunyuan-video-t2v-720p/vae/pytorch_model.pt --vae_chunk_size 32 --vae_spatial_tile_sample_min_size 128 --text_encoder1 ./split_files/text_encoders/llava_llama3_fp16.safetensors --text_encoder2 ./split_files/text_encoders/clip_l.safetensors --seed 69 --lora_multiplier 0.8 --lora_weight ./lora.safetensorsSee this for more info https://github.com/kohya-ss/musubi-tuner?tab=readme-ov-file#inference, the repo also has a converter to convert to diffusion pipe/comfyui format
Description
FAQ
Comments (5)
Well for one, you don't necessarily need 5 seconds -- that is a longer video. 2 seconds + ping pong can give you a half-natural 4 second video. That'll immediately cut your processing to 15 minutes from 30. You can also lower your VAE Decode overlap. The default is 64, you could use 32. I also have 128 tile size instead of 256.
You could also lower steps. If you use FastLora or FastModel, you can sometimes get results at 10 steps that are as good as 20
Interesting. Will try it out.
Generally I'll make quicker 5 step videos of low length, only a second or so, to find a seed I'd like. Then you can crank it to make it a longer natural video. It seems like a Flux model underneath, so Id assume Flux prompts work well enough
Yeah I noticed when messing with length of video, it drastically improved the generation times. I usually play with 73-109 and my vae decode tile is set to the same thing as makia. When the vae decode is set too high, it'll just be stuck decoding for a long time
Found a pretty crazy combo:
208x368 pixels @ 133 frames with 16 steps and 18 fps takes 3 mins
so 7 second video without pingpong