SCAIL-2 Two-Person Biological Long-Video Driving Workflow

Watch the full video first if you want to understand how this SCAIL-2 two-person long-video driving workflow works in practice. The video shows how two characters in a reference image can follow a two-person driving video, while the workflow keeps left and right character assignments, pose structure, temporal continuity, and long-video extension more stable.

This ComfyUI workflow is designed for SCAIL-2 two-person biological long-video driving. Its main purpose is to transfer global two-person motion from a driving video onto two characters in a reference image. This is not a local replacement workflow. It is a skeleton-guided animation workflow where both characters follow the full-body movement from the driving video while keeping their assigned identities and visual roles.

The workflow is built around wan2.1_14B_SCAIL_2_fp8_scaled.safetensors as the main SCAIL-2 model. It also uses WAN VAE, UMT5 XXL WAN text encoding, CLIP Vision identity encoding, SAM3 subject tracking, SCAIL2ColoredMask, WanSCAILToVideo, SamplerCustom, VAEDecode, ForLoop continuation, frame trimming, color matching, video combining, and original audio restoration. The LoRA chain is also preserved for enhancement, including LightX2V, WanAnimate relight, Wan2.2 Lightning I2V, FastWan 480p, Wan21 PusaV1, Wan2.2 Fun InP, and stage-based enhancement LoRAs.

The first important rule of the workflow is size alignment. Both the reference image and the driving video are aligned to 512×896 before entering SAM3, CLIPVision, and SCAIL. This avoids mismatched masks, unstable pose conditioning, and identity drift caused by inconsistent input dimensions.

The second key rule is two-person tracking. SAM3 is configured with max_objects=2. SCAIL2ColoredMask uses object_indices=0,1 and sort_by=left_to_right. This means the workflow treats the left and right subjects as separate controlled identities. The goal is to reduce identity mixing, role swapping, clothing confusion, and left-right character instability during two-person motion transfer.

The workflow uses replacement_mode=false, which means it focuses on two-person skeleton guidance rather than local character replacement. The reference image provides the two target characters, the driving video provides the motion, and the colored mask system links both sides together before entering WanSCAILToVideo.

The long-video structure is one of the strongest parts of the workflow. The first segment is 65 frames and is used to establish the two characters, identity relationship, pose guidance, mask relationship, and motion direction. The continuation segment is 81 frames. Each loop removes 5 overlapping frames, so every loop effectively adds 76 new frames. The loop count is calculated as max(1, ceil((F - 65) / 76)), where F is the loaded frame count of the driving video. This makes the workflow suitable for longer dance videos and longer two-person action clips.

The final video output uses the final generated frame sequence, original video audio, and the unified frame-rate node. This keeps the rendered result aligned with the source rhythm and avoids manually rebuilding the audio track.

Main features:

SCAIL-2 two-person long-video driving workflow
Two-person full-body skeleton guidance
Reference image + two-person driving video structure
512×896 unified input alignment
SAM3 max_objects=2 subject tracking
Left-to-right identity assignment
SCAIL2ColoredMask dual-person mask control
replacement_mode=false for global motion driving
CLIP Vision reference identity encoding
WanSCAILToVideo first segment generation
65-frame initial segment
81-frame continuation segment
5-frame overlap removal
ForLoop long-video continuation
Original audio restored in final output
Unified 24fps frame-rate control
Multi-LoRA enhancement chain
WAN VAE and UMT5 WAN text encoder support

Suggested workflow:

Prepare one clear two-person reference image and one two-person driving video. The reference image should show both characters clearly, preferably full-body or near full-body, with minimal occlusion. The driving video should also contain two visible people with readable movement, stable framing, and limited background clutter. Keep the default 512×896 alignment first. Make sure SAM3 correctly tracks two subjects and that the left-right assignment is correct before rendering. If the two identities swap, check the subject positions and the left-to-right sorting. If motion becomes unstable, use a cleaner driving video with less crossing and less occlusion. Start with a short test segment before running the full long-video loop.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2064973862177173505?inviteCode=rh-v1111

If the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV1w2Ei6pEsJ/

☕ Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk

💼 Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

⚙️打开下方链接即可在线体验，无需安装。
👉 工作流： https://www.runninghub.ai/post/2064973862177173505?inviteCode=rh-v1111
如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利：注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频： https://www.bilibili.com/video/BV1w2Ei6pEsJ/

我会在夸克网盘持续更新模型资源：
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户，方便进行创作与学习。

Description

Details

Files

scail2TwoPerson_v10.json

Mirrors