Wan VACE Masking & Extension - Seamlessly Extend, Join, and Auto-Fill Existing Videos While Maintaining Motion

Wan VACE Masking & Extension - Seamlessly Extend, Join, and Auto-Fill Existing Videos While Maintaining Motion - Wan 2.1

Wan 2.2 largely works if you use the wan2.2_t2v_low_noise_14B file for the Model Loader node and has a much more photorealistic look. It also seems to significantly reduce the color drift if you keep resolution above 720p. Wan 2.1 seems better for loras and a more neutral look though. Specifically Wan 2.1 1.3B seems best for loras if you are trying to do something drastically different, as it seems the higher models are more rigid and locked down.

This is a workflow I posted earlier on Reddit/Github:
https://www.reddit.com/r/StableDiffusion/comments/1k83h9e/seamlessly_extending_and_joining_existing_videos/

https://github.com/ali-vilab/VACE/issues/45

It exposes a somewhat understated feature of WAN VACE, which is the temporal extension. It is underwhelmingly described as "first clip extension." Even worse, the first clip example VACE devs posted only does image-to-video because their example input is just a single frame. In actuality, it can auto-fill pretty much any missing footage in a video that is masked out - whether it's full frames missing between existing clips or things masked out (faces, objects). It seems the VACE devs themselves do not fully get what their model can do, since none of their examples explore this.

It's better than Image-to-Video / Start-End Frame because it maintains the motion from the existing footage (and also connects it to the motion in later clips).

Watch this video to see how the source video (left) and mask video (right) look. The missing footage (gray) is in multiple places, missing face, etc that is all then filled out by VACE in one shot.

It takes in two videos, your source video with missing frames/content in gray and a mask video that is black-and-white (the missing gray content recolored to white). I usually make the mask video by setting brightness to -999 or something to that effect on the original while recoloring the gray to white.

Make sure to keep it at about 5-seconds to match Wan's default output length (81 frames at 16 fps or equivalent if the FPS is different). You can download VACE's example clip here for the exact length and gray color (#7F7F7F) to use on the source video: https://huggingface.co/datasets/ali-vilab/VACE-Benchmark/blob/main/assets/examples/firstframe/src_video.mp4

In the workflow itself, I recommend setting Shift to 1 and CFG around 2-3 so that it primarily focuses on smoothly connecting the existing footage. I found that having higher numbers introduced artifacts sometimes.

Tips to maximize video quality and minimize loss of details or color-drifting:

Keep CFG 2-3 and Shift=1 to retain as much detail from the existing footage as possible.
Render at 1080p resolution to minimize color drift. CausVid helps reduce the render time by over 5x (8 steps instead of 50).
Color Match node in ComfyUI on MKL setting to get it reduced (not always applicable if the scene changes a lot).
Post correct in video editor the hue by about 2-7 and desaturate a little bit to counteract the drift.
Starting the scene initially with regular I2V when possible (no color drift) and masking in new changes with VACE (with feathering to blend pieces in and use as much as the I2V scene as possible with no color drift). Alternately extending in FramePack with Video Input or SkyReels V2 as well to get a "skeleton" of the scene without color drift and then patching changes in with VACE.

Models to download:

models/diffusion_models: Wan 2.1/2.2 T2V (Pick 1, VACE's 14B/1.3B below):
Wan 2.2 T2V Low Noise 14B FP16: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/blob/main/split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp16.safetensors
Wan 2.2 T2V Low Noise 14B FP8: https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/blob/main/split_files/diffusion_models/wan2.2_t2v_low_noise_14B_fp8_scaled.safetensors
Wan 2.1 14B FP16: https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/diffusion_models/wan2.1_t2v_14B_fp16.safetensors
Wan 2.1 14B FP8: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-T2V-14B_fp8_e4m3fn.safetensors
Wan 2.1 1.3B FP16: https://huggingface.co/IntervitensInc/Wan2.1-T2V-1.3B-FP16/blob/main/diffusion_pytorch_model.safetensors
Wan 2.1 1.3B FP8: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-T2V-1_3B_fp8_e4m3fn.safetensors
models/diffusion_models: WAN VACE (Pick 1, Match Wan's 14B/1.3B above)
14B BF16: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-VACE_module_14B_bf16.safetensors
14B FP8: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-VACE_module_14B_fp8_e4m3fn.safetensors
1.3B BF16: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan2_1-VACE_module_1_3B_bf16.safetensors
models/text_encoders: umt5-xxl-enc (Pick 1):
BF16: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/umt5-xxl-enc-bf16.safetensors
FP8: https://huggingface.co/Kijai/WanVideo_comfy/blob/main/umt5-xxl-enc-fp8_e4m3fn.safetensors
models/vae: WAN 2.1 VAE (all WAN versions):
https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/blob/main/split_files/vae/wan_2.1_vae.safetensors
models/loras: WAN CausVid V2 14B T2V, reduces steps to 8 (for Wan 2.1 14B only): https://huggingface.co/Kijai/WanVideo_comfy/blob/main/Wan21_CausVid_14B_T2V_lora_rank32_v2.safetensors

Tutorial/walkthrough videos:

Description

FAQ

Comments (19)

FleshcrafterMay 3, 2025

CivitAI

So this is just for the 1.3b?

pftq

Author

May 3, 2025

Yeah there isn't an update to VACE yet, hopefully soon

funscripter627May 3, 2025

CivitAI

Isn't this exactly what the WanVideo VACE Start To End Frame node does?

pftq

Author

May 3, 2025· 3 reactions

No as I said in the description, start/end frames does not maintain the motion from the existing video (you get the reset/snap feel of the video being spliced).

funscripter627May 3, 2025

@pftq Aaaah ok thanks. Gonna have to try that out later!

fronyaxMay 3, 2025

CivitAI

If I understand the workflow correctly, in order to join my 2 short videos using the workflow, I first need to combine them into one video first and insert a grey frame between them and then put that video into the load video node right??

Also, how do you generate the mask video based on the source video? I'm trying to seamlessly join my 2 videos (one is 2 seconds, the other is 3 seconds) using the workflow, but I'm stuck on understanding the part about the black and white mask video. No offense but the tutorial video itself is not really helpful.

pftq

Author

May 3, 2025

Yes to the gray frames between the two video clips. The Mask video - you want to recolor the gray to white and make everything else black. I usually do so by setting brightness to -999 or something to that effect. The first YouTube video has the mask and source video side by side so you can see how it translates.

fronyaxMay 3, 2025

@pftq ok got it, btw how many gray frames do you think would be enough to achieve smooth transition? 16 frames (1sec), 32 frames (2sec) or 8 frames (.5 sec) is enough??

And do you use other software outside comfyUI to insert the gray frame and recolor the frame Or comfyUI itself has this kind of functionality? a node maybe.

pftq

Author

May 3, 2025

@fronyax I use After Effects - but really any video program will do. I thought about making nodes to create gray frames between two videos, but then it would make people not realize you could put the gray "anywhere" in the video and not just in one spot. For example, you might want gray frames at the start or at the end as well. And of course, putting boxes over the face or objects to inpaint. This feature was already very undersold as "first clip extension" and there are already a lot of general VACE workflows for more limited use cases, so I wanted to really show what VACE could do here unrestricted.

I think the number of existing frames matters more than the gray frames - the existing frames is what the model uses to figure out what's going on. In that case, I usually have at least 1 second of footage. The number of gray frames is really just down to how you want to time the scene (I don't know what you are making). It's more about whether there's enough time for the start footage to connect to the end footage in a logical way. It'll fail harder if it gets forced to teleport someone from A to B because it only had a split second to do it for example.

darkroast175696Jun 19, 2025

@fronyax After playing with pftq's workflow for a while (and failing because my system chokes on the Wan Wrapper process due to heavy VRAM usage) I decided to work up a version using native nodes that also automates the join part as mentioned by pftq in this thread. If you want to try my workflow, it's here: https://civitai.com/models/1695320

spellweaverbgAug 2, 2025

darkroast175696 Oh thank you, I have been trying to automate this process but couldn't wrap my head around the method of gathering the required frames from the two videos that would be joined, generating the connecting video and then putting the changes into mp4. Thank you both for your hard work.

jm112368767May 3, 2025· 1 reaction

CivitAI

So, if I take a 5 second video of my dog walking and I append like 100 frames of gray fullscreen frame to the end of it and run this, will it try to add 100 frames that matches the original video... i.e., adds 100 frames of the dog walking? I guess I should just try it out...

pftq

Author

May 4, 2025

That's the idea yes. With regular I2V start/end frame, you might get the walking motion reset or jitter, but this should keep it going smoothly with the previous motion in mind. That said, it should be 81 frames max due to Wan limitation.

jm112368767May 4, 2025

@pftq thanks for the reply. I tried it and it worked pretty good for my use case. I was actually able to 'expand' all 437 frames of my source video (granted its a small 512x512 video that I expanded to 512x768). Thx for the workflow!

jm112368767May 4, 2025

@pftq I tried it and was able to expand my 512x512 video to 512x768 for all 432 frames in one go. Worked pretty good, granted my video is very simple so was likely a best-case scenario.

pftq

Author

May 4, 2025

@jm112368767 Great to hear!

black_jack_5223May 27, 2025

CivitAI

Thanks for sharing. It worked perfectly

olivettyJun 13, 2025

CivitAI

What cinematic finetuning of Skyreels? :O

darkroast175696Jun 19, 2025

CivitAI

I wanted to say thanks for the workflow and the lesson in Vace's capabilities. Because my hardware really struggles with the WanWrapper nodes workflows, I made a simplified variation using native Comfy nodes that only focuses on joining two videos together. That way I could automate the creation of the masks and save myself a bunch of time and effort in video editing tools since I'm a novice at those. If you want to play with it, it's here: https://civitai.com/models/1695320

There are so many possibilities with this, from just extending a video to more frames to out-filling the video to a bigger size and so on, I'm sure I'll be playing with this system for a while. Thanks again for sharing!

Workflows

Wan Video 14B t2v

by pftq

Download (Beta) View on CivitAI