CivArchive
    Gurren Lagann / Anime Style Wan 2.2 14B Lora - 5B (BETA)
    NSFW

    Whats new

    10/15 - Trained a qwen version 14.5k steps using AI toolkit

    9/11 - Trained the High Lora V1.1 to 117K steps and fixed many motion and quality issues. Try it out as the high makes a lot of changes for this lora, alternative epochs between 125-300 can be found here for high if you wanna test earlier versions

    FYI reward MPS loras released, try them on 0.5 strength. Not super tested with this lora though

    What is this lora

    This is a style lora used to recreate the style of the 2007 anime from Gainax's Tengen Toppa Gurren Lagan. This is one of my favorite series and holds a special place in my heart for when anime really hit home with me, it's sort of a deconstruction of the mecha anime series that Gianax was responsible for creating in its own way, it explores all the overused tropes of the genre but executes them perfectly. It's the perfect companion piece to their masterpiece “Gunbuster”, watch both and you can see what I mean. Every arc is a story about triumph over overwhelming oppression and grief and then the reset and kick in the ass we need to pick ourselves up and do it to overcome it again. They take you to the darkest places emotionally and then shoot you right back up to the top.

    The art style of the show plays around really well with lighting in both dark and bright scenes. And the motion and animation style is very interesting, things move really fast and quick in bursts of action. They do some kind of wide shot with fast animation for action, then cut to a reaction with a medium close up of the characters. The animation is top notch, and it's funny you can tell when they’ve spent all their budget and then you get some really poorly done work in the next episode but it's all worth it for those 5-10 mins of S-tier animation.

    This style lora, its purpose is to get the style of the show: visuals and motion. It's not a character lora, but with proper prompting the characters will come out.

    Trigger word: GurrenLagannStyle

    (You do not need to add any other descriptions for anime or animation style in the prompt, it should make it the style without any other prompting). In fact I would recommend against adding anime keywords to the prompt as it will create more of a bias from the base model which is now trained on anime much better than before. The trigger word may not even be needed but I put it in anyway.

    All the characters from the first season of the show are in the training data. So there is no dataset from post-time skip. There is data from the Yoko “Pieces of Sweet Stars” music video. I elected not to include parallel works since the style is not the same. The images are from the show, while the clips are all from the remastered movie #1. So there are some new scenes too.

    Here is how to recreate some characters (check the captions data for more)

    Yoko:

    A woman with long red spiky hair tied in a long pony tail and chopsticks and skull accessory, red flame-patterned black bikini top, light pink scarf, black shorts with a white studded belt, pink thigh-high stockings, fingerless black gloves, and white and red boots. She holds a massive dark grey, hexagonal-barreled rifle.

    Simon:

    A young man (or “boy”) with spiky dark blue hair, a blue jacket over his bare torso, and red goggles on his head 

    Kamina:

    A muscular man with spiky blue hair and blue spiral tattoos. He is wearing orange frameless pointed triangle sunglasses and a red tattered cape. He has bandages on his forearms

    Nia:

    A young woman with wavy blonde and light blue hair and teal eyes with red pupils cross-like in a  floral shape. She wears a pink and white dress with a large gold belt and cuffs. And an elaborate golden collar with red and green gems, a red tie, and a pink and white hair accessory.

    Gurren & Lagann:

    A humanoid mecha (might want to describe about the face on his torso etc, but its not captioned much). Be sure to mention the samurai horns on the head if you want that etc. Every form of them is in the data too (flying mode, battle ship etc.)

    Viral:

    A man with shaggy blonde hair that covers one eye. He wears a jacket with a white fur-lined collar and red shoulder pads.

    Mecha:

    All mecha in this are labeled using “mecha” you can say like “whale-like mecha” or “turle-like mecha” etc. to get different types. There is probably all the different ones in the training data. Just use this phrase “mecha” to trigger it.

    Beastmen = “creature” , ie turtle-like creature, etc.

    Buta:

    A small, brown, pill-shaped pink pig-mole creature with two long, thin antennae, a curly tail, whiskers and round sunglasses

    Lordgenome:

    A large, extremely muscular man with a shaved head, a dark stylized beard, and intense, light-colored eyes. He is shirtless wearing a dark garment with two large, silver, U-shaped bracelets on his arms. (I guess I missed captioning the beard, try adding that word in too)

    There is more, I should have covered every character major or minor in the first season. So try yourself to describe them or check the captions.

    Recommended Settings:

    Do NOT use euler, it will distort all the motion. Use sampler dpm++_sde. And split from the high to the low at 11th step for 20 steps. I find best result of 8 shift high and 6 shift on low. Though 8/8 shift is ok too. Shift 5 has distortion. Too few frames may mean the style wont trigger in the low model, so try to keep above 40 (ideally 65-81 frames) but test and let me know. Adding “a little red mecha toy in the background” will trigger 100% of the time. For some reason prompting sexual things or nudity since it's not in the training data might require this workaround with the mecha toy to trigger the lora.

    This lora was extensively tested without using lighting/lightx loras. But it should work fine with them. I need some time to test lightning/lightx but my opinion is that they both heavily modify the style in different ways, not using is recommended for those reasons but they do look fine (just different) so you can give them a try. Let me know what combination looks best. I think you can be the judge, personally without those loras is best, but if you must use them then lightx1.5 high / lightning 1.0 low is not bad though color is a bit saturated.

    Here is a link to a gallery which shows how they affect the lora.

    1.) Default Setting
    Just run the lora with no other loras and it will work fine. And it will retain the closest look and feel to the original source material. On a 3090 it takes over 20 mins for a 720p video to generate.

    20 steps (11 steps high / 9 steps on low), 3.5 CFG, NO NAG, dpm++_sde, shift 8 and them shift 6

    Benefits: Closer to trained data. You get all the 2.2 benefits like motion, quality, camera control etc.

    Negatives: Slower, more resource intense

    2.) Lightx2V Wan 2.1 Lora Optimization

    1.) This lora (gurren lagann style lora) (strength 1.0 on both high and low)

    2.) Wan21_T2V_14B_lightx2V_cfg_step_destill_lora_rank32 (strength 1.0 on both, use the same lora file on both high/low)

    7 Steps ( 3 / 4), though you can try 4/4 or 2/2. CFG 1 with NAG

    Benefits: Can complete higher resolution with fewer steps. The motion is retained and style closer default than lightning lora.

    Negatives: Lightx2V is a Wan 2.1 lora so I think you downgrade the output to look more like 2.1 than 2.2. I feel also that the colors are a bit dark. It adds some weird snow effect sometimes which can be mitigated by increasing strength on the lightx2v loras.

    3.) Lightning 1.1 Wan 2.2 Lora Optimization

    7 Steps ( 3 / 4), though you can try 4/4 or 2/2.CFG 1 with NAG

    1.) This lora (gurren lagann style lora)) (strength 1.0 on both high and low)

    2.) Wan 2.2 Lighting v1.1 loras (strength 1.0 on both high and low)

    Benefits: Can complete higher resolution with fewer steps. It kinda makes colors brighter and less saturated if you like that aesthetic. Its a 2.2 lora so you technically get benefits from 2.2 wan but its kind of not working properly.

    Negatives: It effects the style heavily, it still looks anime retro but the colors are brighter than the source material. The motion is HEAVILY reduced.

    4.) Mixed approach: Lightxv2 on 1.5 str on high / Lightning on 1 str on low.

    Benefits: Fewer steps for less resources.

    Negatives: Saturated colors. You’re mixing 2.1 with 2.2 loras which makes them more like wan 2.1. Some of the motion distortion is reduced compared to no loras.

    5.) Other 2.1 loras

    Dataset:

    441 images directly screen captured from the show at 1920 x 1080

    134 videos with clips directly taken from the show using PySceneDetect at 1920 x 1080 and converted to 16fps via ffmpeg.

    I wanted to keep everything within 24GB to fit on my 3090 for local training.

    So I set the below settings in my dataset.toml file

    Images at [512] resolution with enable_ar_bucket = true (which allows diffusion pipe to set the resolution at 16:9 for me at 512 standard.

    Videos I kept within these frame buckets [8, 12, 16, 24, 32, 48] and resolutions = [256].

    Using handbrake, I went through every clip I chose and chopped them down to these frame buckets. Most of them landed in 32 / 48 frames. I had a few 80+ frames videos which I chopped up into 48 and 32 frame clips.

    Doing the above allowed me to train with around 22/24gb of vram, with no block swap! I think it was around 2 weeks straight of training since I had to trash half the work.

    Captioning:

    I used google gemini via ai studio using this prompt below and I fed it videos and images in batches of 5. It used to always go off the rails around 150K tokens in, but now it seems fine without any re-prompting. Maybe I had to do once to reprompt. The captions came out 80% there, I did a small brush up on most of them, and a few had to be completely redone by hand.

    You are an advanced image captioner for WAN AI video generation models. Your goal is to create vivid, cinematic, highly detailed captions for training loras in wan 2.2 T2V 14B model with diffusionpipe therefore your captions follow wans syntax. Our goal for this time is to create a style lora for the anime series "Tengen Toppa Gurren Lagann". You will get fed video clips from the show. Never use any character names, purely describe each caption generically so that in training it will pick up the style of the way things are created. Do not use phrases like "or" when describing be precise and choose a description you think is closest. Do not refer to the subject as "the subject" state simply "a man wearing" or "a woman in a car" etc. refer to adult male as "man" and an adult woman as "a woman" you can use modifier like "young woman" or "girl" but lets not use male or female. also be precise dont say "appears to be" etc. Make sure you describe everything aside from the style including the clothing they wear in detail.

    Prompt Rules:

    Every prompt must begin with: "GurrenLagannStyle".

    Use clear, simple, direct, and concise language. No metaphors, exaggerations, figurative language, or subjective qualifiers (e.g., no "fierce", "breathtaking").

    Our purpose is to describe everything in the image or video, with special attention to describing the people whenever they are present. Describe each individual piece of clothing including the colors and positions. We want a standard description of their appearance and usual clothes, but at the same time we need to describe the environment as that is part of the style as well.

    Describe what is in the image, but not what the image is. Such as "A photo depicting a cosplay of" is wrong. Just say "Live action Bowsette..." and then describe the image.

    When an exaggerated or "chibi" face or depiction is shown make sure to note it in the captioning. Lets be uniform in our word choices when possible.

    Prompt length: No length, long and detailed is perfectly fine. Stick with the structure of the wan reference documents.

    Follow this structure:

    Prompt = Subject(Subject Description) + Scene(Scene Description) + Motion(Motion Description)+Aesthetic Control + Stylization

    Subject Description : Details about the subject’s appearance, described using adjectives or short phrases. For example: "A black-haired Miao girl wearing ethnic minority clothing" or "A flying fairy from another world, dressed in tattered yet elegant attire, with a pair of strange wings made of rubble fragments."
    Scene Description : Details about the environment where the subject is located, described using adjectives or short phrases.
    Motion Description : Describes the characteristics of movement, including amplitude, speed, and effects of the motion. Examples: "Violently swaying," "Slowly moving," or "Shattering glass."
    Aesthetic Control: Includes elements like Light Source, Lighting Environment, Shot Size (Framing), Camera Angle, Lens, and Camera Movement. For common cinematic terms, please refer to the Prompt Dictionary below.
    Stylization : Describes the visual style of the scene, such as "Cyberpunk," "Line-drawing illustration," or "Post-apocalyptic style." See the Prompt Bank below for common styling examples.

    Composition and Perspective (framing)
    Choose from: Close-up | Medium shot | Wide shot | Low angle | High angle | Overhead | First-person | FPV | Bird’s-eye | Profile | Extreme long shot | Aerial

    Motion (cinematic movement) (only used when describing video sources)
    Use: Dolly in | Dolly out | Zoom-in | Zoom-out | Tilt-up | Tilt-down | Pan left | Pan right | Follow | Rotate 180 | Rotate 360 | Pull-back | Push-in | Descend | Ascend | 360 Orbit | Hyperlapse | Crane Over | Crane Under | Levitate | Arc |

    Describe clearly how the camera moves and what it captures. Focus on lighting, mood, particle effects (like dust, neon reflections, rain), color palette if needed. Be visually descriptive, not emotional. Keep each motion or camera movement concise — each representing about 5 seconds of video.

    Use simple prompts, like you're instructing a 5-year old artist but follow Wan principles for syntax and wording so the lora can be properly trained with this caption data you're creating . Reference the attached images/videos and caption them. Format the captions as a prompt, so we dont need the label of scene subject action etc for the captions themselves. For example (From the raven lora we captioned for in the past)

    Raven, with pale lavender skin and her short, dark purple angular hair, is shown in a yoga pose resembling an upward--cut legs. A small, dark purple bowtie is at her neck, and white cuffs are on her wrists. Tall, dark purple bunny ears are perched on top of her head. Her hands are raised on either side of her headfacing dog, against a plain white background. A red gem is on her forehead. She wears her black long-sleeved leotard, a gold-colored belt with visible red gems, and dark blue cuffs with gold and red circular details on her wrists. Her body is arched, supported by her arms straight down to the floor and the tops of her bare feet. Her head is lifted, looking forward and slightly upwards with a surprised or inquisitive expression, her mouth slightly open. The Camera is waist height and lower looking up at Raven in a semi profile view. Camera Tracking Shot.

    Sample Prompt:
    GoldenBoyStyle. Interior setting. A young man with short dark hair, a red baseball cap backwards, wears a light green t-shirt. His face has an extreme comedic expression of lecherous excitement, with wide, crazed eyes, a broad, toothy grin, and prominent red blush marks on his cheeks. He is holding an open, dark brown notebook with a white pen, writing intently. Close-up shot, focusing on his exaggerated facial expression. Static Camera.

    Training info:

    I am going to keep the details short because I intend to release an article soon as a tutorial of how to train an anime style lora in wan 2.2 using this lora as an example and it will be super detailed. So I think you can reference that article when it's released. I will edit here with a link when it's out and link it as a resource on the model page. Short and sweet info for now:


    [model]

    type = 'wan'

    ckpt_path = '/data/trainingstuff/wan2.2_base_checkpoint'

    transformer_path = '/data/trainingstuff/wan2.2_base_checkpoint/low_noise_model'

    #transformer_path = '/data/trainingstuff/wan2.2_base_checkpoint/high_noise_model'

    dtype = 'bfloat16'

    transformer_dtype = 'float8'

    timestep_sample_method = 'logit_normal'

    #min_t = 0.875

    #max_t = 1

    min_t = 0

    max_t = 0.875

    #Commented out the high model settings, so you just uncomment them when training high and comment out the low noise timestep and transformer path settings..

    [adapter]

    type = 'lora'

    rank = 32

    dtype = 'bfloat16'

    [optimizer]

    type = 'adamw_optimi'

    lr = 2e-5

    betas = [0.9, 0.99]

    weight_decay = 0.01

    eps = 1e-8

    Lets do a small talk on graphs since I will put more details in the guide later.

    Low graph:

    It does the zig-zag downward. I think this is what we expect from low graphs from now on.  It kinda looks like it's going flat and then suddenly drops 0.001 or so. This honestly can keep going, so I’ll train it a bit more until I see a negative effect, but now the style is there so it's ok by me to stop around 17k steps.

    High graph:

    Yep this is what a normal 2.2 high graphs look like. It does a C-shape until it starts to flat-line. I trained it up to around 17k steps.

    High / Low Testing:

    I will put more info in the guide I'm working on. Short and sweet info here now.

    Having to test 2 different loras is super stressful and difficult. And the rules for character loras do not apply here. The advice I was seeing was for 2.2 character loras is to train the high as little as you can to test to see if it will show a blurry output, if you start to see character features then its overtrained. BUT with anime style lora, if there features / details are not present in the high model, then the low model will look off and the style will not be there. So, I am of the opinion that you need to just train both A TON  and then do trial and error on the high and low. Probably use the same high# step low epoch and then test various epochs on the high model. Ie LOW epoch 125 with High epoch 5, 30, 100, 125 etc. And see what “looks” the closest. And also take note, that this is not just a style lora but also a motion lora (remember I mentioned about the fast movements earlier).

    So I run in batches of 4 of various high epoch on the high model with the same low model epoch. I recommend “extreme close up medium shot” on 832 x 480 to see how the style is.

    You can see far left matches the show style best, even if the character isn’t matching yoko properly (that can be fixed by better prompting and seeds). Maybe high 55 isnt bad either, it may be worth doing another run of tests between 125 and 55 epoch and hone in on the better one. Also remember this isn't a character lora, this is for style and motion. As for motion, I do the same above and look at how stiff the camera becomes, or the character movements. There is also some motion distortion that I was not able to completely remove, but the high seems to make it more apparent sometimes. So I look out for that too. Run those tests and pick the best.

    Here is a great example how much the high lora effects the final style (high 30 looks like a completely different character style).

    To make things simple my advice is: train both a ton (17K steps in this case), use the most trained low epoch and then test various epochs of high against it. Then when you find the high you could go back and test it with the low. Though I didn’t really do that, I think you can just train the low a bunch until you notice something going wrong. Also the loss # itself does NOT matter. What is important to look at is the TREND. It needs to follow the patterns in the example graphs. But you get great results at 0.1 loss, unlike in 2.1 wan where you aimed for 0.01 or 0.02. This still needs more time to find conclusions, but from this lora we found the best result is just the highest trained low/high together…

    Closing Thoughts:

    I don’t think I was 100% successful, but it’s good enough for now until I can learn more. All in all, this lora does need more time and testing. But for my mental health, I need to take a break from it. I will go back and test more and learn more about how wan 2.2 works. There is some distortion in the eyes from when they’re far away in lower resolutions. Fast movements also have some distortion. I think though when things are moving its not as noticeable, just like in traditional animation if you freeze a high moving frame it will look odd (see examples posted from the simpsons classic episodes online). I had thrown away almost 40K steps trying to fix it. Both the high/low are VERSION 2 actually. I will provide some alternative epochs for the high lora I think and you can experiment and let me know what works best. And I will probably update as updated versions.

    Special Thanks:

    Thank you so much to everyone in the training channel on the Banodoco discord server. You guys gave me advice to fix so many problems. And it was nice to check in the progress and get your feedback. As always, everything here is not new, its all based on the research and work of Seruva19, so check out his loras and read his really detailed write ups to get it straight from the source. And always a big thank you to Kijai for his help answering questions and making great nodes.

    Description

    5B version using the same dataset from the 2.2 14B version.

    This was more for fun than anything else, I wanted to see what the limits of the 5B model are when it comes to training style. Often the people come out sorta 3D anime style and no amount of training fixes it. Don't have high expectations for this... I think for environmental shots it might have some potential due to just how high the resolution can be pumped up to.

    Recommended settings:

    Steps 50 (yes I know a lot), you can get away with 30 steps though.

    CFG: 5 ; Shift: 8

    Scheduler: dpm++_sde

    If you wanna use turbowan I recommend either 4 steps with flowmatch_distill, or probably best is 8-10 steps with dpm++_sde (CFG 1 Shift 8). Turbo wan is super finicky, sometimes the output is amazing and often its very distorted. If you know the best settings for turbowan please let me know and I'll update here.

    If the style doesn't trigger try adding "2d style anime" or "animated" to the prompt. It shouldn't need that word though.

    Training:

    Trained locally on a 3090 for around 60 hours (22K steps) using same dataset as the 2.2 14B version. It may be over trained, I will include a some epochs in the "training data" you can test and let me know what worked best. epoch 333 had good loss and things kind of flat lined around 300. I but I think epoch 223 is quite good too. Check the training data zip and compare them and let me know which is best.

    Settings is nothing special except I did change micro_batch_size_per_gpu from 1 to 8 since it was only using 10gb of VRAM, with this setting it goes around 23GB of vram. Just enough without block swap.

    micro_batch_size_per_gpu = 8

    [model]

    type = 'wan'

    ckpt_path = '/data/trainingstuff/wan2.2_5B_base_checkpoint'

    transformer_path = '/data/trainingstuff/wan2.2_5B_base_checkpoint'

    dtype = 'bfloat16'

    transformer_dtype = 'float8'

    timestep_sample_method = 'logit_normal'

    [adapter]

    type = 'lora'

    rank = 32

    dtype = 'bfloat16'

    type = 'adamw_optimi'

    lr = 2e-5

    betas = [0.9, 0.99]

    weight_decay = 0.01

    eps = 1e-8