This ComfyUI workflow is designed for single-person InfiniteTalk native looping, audio-driven talking video generation, and long-duration digital human video production. The main goal of this workflow is to generate a talking character video from a start image and an audio file, then continue the motion through a native loop structure so creators can extend the video duration more naturally instead of being limited to one short fixed segment.
Unlike a simple image-to-video talking head workflow, this graph is built around continuation. It does not only generate one short speech clip. It uses InfiniteTalk-style video conditioning, audio encoder features, previous-frame continuation, and repeated generation passes to help maintain character identity, mouth movement, facial motion, and temporal continuity across longer outputs. This makes it suitable for AI presenters, digital humans, narration avatars, virtual hosts, character dialogue clips, product explanation videos, and long-form talking-head content.
The workflow is built on the Wan 2.1 InfiniteTalk structure. It uses the Wan video model route, UMT5 text encoder, Wan VAE, wav2vec2 audio encoder, InfiniteTalk model patch, image input, mask input, audio input, sampler control, and CreateVideo output nodes. The core generation node is WanInfiniteTalkToVideo, which receives the model, model patch, positive and negative conditioning, VAE, audio encoder output, start image, optional previous frames, mask information, width, height, video length, motion frame count, and audio scale.
The key concept is native looping. In many talking video workflows, the user generates a short segment and then manually loops it in video editing software. That often creates visible cuts, frozen transitions, or repeated motion patterns. This workflow is designed to continue from previous generated frames, making the next segment more connected to the earlier motion. The previous_frames input allows the workflow to reuse the earlier visual state as part of the next generation stage, which is important for longer talking videos.
This workflow is especially useful when the user wants to build an “unlimited duration” talking character pipeline. In practical terms, long videos can be produced by segmenting the audio and generating multiple connected clips. Each segment can use the previous frames from the earlier output to maintain continuity. The result is not literally infinite in one single sampling run, but it provides a repeatable native-loop method for extending the video in a more controlled way.
The workflow starts with the input image. The start image defines the character identity, face, clothing, body framing, and visual style. A clear portrait or half-body image usually works best. The face should be visible, the mouth area should not be blocked, and the character should not be too small in the frame. Stable front-facing or slightly angled portraits are usually better for talking video than extreme angles or heavily occluded faces.
The audio file drives the speaking performance. The workflow uses an audio encoder, such as wav2vec2-chinese-base_fp16, to extract speech features from the input audio. These audio features are then passed into InfiniteTalk so the generated video can respond to the voice rhythm. This is important for lip movement, mouth timing, facial expression changes, and natural speech pacing. Clean audio usually gives better results than noisy or heavily compressed audio.
The workflow also includes audio passthrough into the final video output. CreateVideo combines the generated image frames with the audio track, allowing the final video to keep the original speech or narration. This is important for digital human workflows, because the visual output must stay synchronized with the audio. A talking video workflow is only useful if the generated mouth movement and the final audio remain aligned.
The text encoder route uses UMT5 for Wan conditioning. The prompt can describe the character, visual style, expression, scene, lighting, camera stability, and desired performance. For a single-person talking video, the prompt should usually remain stable and restrained. Overly complex prompts may cause identity drift or unstable motion. A good prompt should describe a single person speaking naturally, with subtle head movement, stable camera, clear face, natural mouth movement, and consistent lighting.
The negative conditioning route is used to suppress common problems such as distorted mouth movement, unnatural facial expression, jitter, flickering, identity drift, face deformation, unstable eyes, duplicated teeth, incorrect mouth shape, broken jaw motion, or excessive head movement. For talking video workflows, the most important visual details are face stability, mouth quality, eye consistency, and smooth frame-to-frame motion.
The workflow includes mask logic. Mask inputs can help define the character or speaker region when needed. In more complex InfiniteTalk setups, masks are often used to separate speaker areas or control motion regions. For a single-person workflow, the mask can help guide where the model should focus character motion and where it should remain stable. This can reduce unwanted background movement and help preserve the main subject.
The video generation settings include width, height, length, motion frame count, and audio scale. These values are important because talking video generation depends heavily on duration and timing. In the included setup, the workflow uses a portrait-friendly generation structure with 25 fps video creation. The length and motion frame count determine how much video is generated per segment. The audio scale affects how strongly the audio controls the character motion.
The sampler section uses CFGGuider, BasicScheduler, KSamplerSelect, RandomNoise, and SamplerCustomAdvanced. This gives the workflow more direct control over the generation process than a simplified one-click node. CFG controls prompt strength, the scheduler controls sampling behavior, random noise controls seed variation, and SamplerCustomAdvanced performs the main latent sampling step.
The workflow also includes multiple generation stages. One stage can generate an initial talking clip from the start image and audio features. Another stage can continue from previous frames, creating a more native loop or continuation effect. This is the key reason the workflow is useful for long-duration content: it is structured around continuation rather than only one isolated short output.
After generation, VAEDecode converts the latent video back into frames. CreateVideo then builds the final video from frames and audio. This makes the workflow usable as a complete production route: input image and audio, generate speaking video, continue through looping logic, then export a playable video.
Main features:
- Single-person InfiniteTalk video workflow
- Native looping and continuation structure
- Audio-driven talking character generation
- Start image to speaking video
- Previous-frame continuation support
- Suitable for longer digital human videos
- WanInfiniteTalkToVideo core generation node
- Wan 2.1 InfiniteTalk model patch support
- wav2vec2 Chinese audio encoder support
- UMT5 Wan text encoder support
- Wan VAE decoding support
- Mask input support for controlled speaker region
- SamplerCustomAdvanced generation pipeline
- CFGGuider and BasicScheduler control
- CreateVideo output with audio
- Useful for AI presenters, virtual hosts, narration avatars, and talking character clips
Recommended use cases:
Single-person digital human video, talking avatar generation, AI presenter video, virtual host narration, product explanation video, education video narration, short-form talking character content, long-form voice-driven video, Chinese speech-driven talking head generation, AI video podcast avatar, social media narration, Bilibili creator content, YouTube talking video, RunningHub workflow publishing, and Civitai demo video showcases.
Suggested workflow:
Start by preparing a clean character image. A front-facing or slightly angled portrait is recommended. The character should have a clear face, visible mouth area, stable lighting, and enough resolution. Avoid images with blocked mouths, extreme side angles, heavy motion blur, very small faces, or complex face accessories that may interfere with lip movement.
Prepare the audio file next. Use clean speech with stable volume and limited background noise. The workflow can follow voice rhythm more reliably when the speech is clear. If the audio contains music, echo, multiple voices, or heavy background noise, the lip movement may become less stable. For the first test, use a short clean audio clip before trying longer narration.
Write a simple and stable prompt. For talking videos, the prompt should not overcomplicate the visual scene. Describe the character speaking naturally, facing the camera, with subtle head movement, clear facial expression, stable lighting, and no excessive body motion. If the character needs a specific style, include it clearly but avoid changing identity details too strongly.
Use the negative prompt to suppress face and mouth problems. Useful negative terms include distorted mouth, bad lip sync, flickering face, unstable eyes, broken teeth, duplicated mouth, deformed jaw, excessive head shaking, face drift, identity change, blurry face, and unnatural expression. This helps keep the talking video cleaner.
Generate the first segment from the start image and audio. Check whether the character identity is preserved, whether the mouth follows the voice, and whether the motion is stable. If the first segment is not good, adjust the prompt, source image, audio quality, or seed before continuing into longer output.
Use the continuation / previous-frame section for native looping. After the first segment is generated, use its ending frames or previous frames as context for the next segment. This helps the next generation start from a related visual state instead of restarting from the original still image every time. This is the main workflow logic for extending duration.
For longer videos, split the audio into manageable segments. Generate each segment in sequence, using continuation frames to maintain visual flow. This is usually more stable than trying to generate a very long video in one pass. It also makes it easier to fix one failed segment without regenerating the entire video.
Keep fps consistent across segments. The workflow uses CreateVideo with a 25 fps setting. If you change fps, make sure the audio duration and frame count still match properly. Inconsistent frame rate settings can cause timing drift, lip sync problems, or visible cuts between segments.
Check each loop or continuation point carefully. Look for mouth continuity, face shape, eye direction, head position, lighting stability, and background consistency. If the transition feels sudden, use a better ending frame, reduce motion intensity, or generate a shorter continuation segment.
Use audio scale carefully. A stronger audio scale can make the character respond more clearly to the speech, but too much influence may create exaggerated mouth or facial movement. A lower audio scale may make the video more stable but less expressive. Test this value based on the voice style and desired performance.
For AI presenters and virtual hosts, keep the camera stable and avoid excessive action prompts. For character dialogue, you can allow more expression, but still keep the face readable. For product explanation videos, use clean background, stable lighting, and a calm delivery style. For social media clips, short segments with strong identity preservation are usually easier to control.
This workflow is designed for creators who need a practical InfiniteTalk pipeline for single-person long-duration talking video. It combines start image conditioning, audio feature extraction, Wan InfiniteTalk generation, previous-frame continuation, mask control, sampler control, VAE decoding, and final video creation into one workflow. It is especially useful for building repeatable AI talking-character videos without manually rebuilding every segment from scratch.
🎥 YouTube Video Tutorial
Want to know what this workflow actually does and how to start fast?
This video explains what the tool is, how to launch the workflow instantly, and shares my core design logic — no local setup, no complicated environment.
Everything starts directly on RunningHub, so you can experience it in action first.
👉 YouTube Tutorial: https://youtu.be/OjsHOyPtF0s
Before you begin, I recommend watching the video thoroughly — getting the full context helps you understand the tool faster and avoid common detours.
⚙️ RunningHub Workflow
Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2018256491320451073?inviteCode=rh-v1111
If the results meet your expectations, you can later deploy it locally for customization.
🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!
📺 Bilibili Updates (Mainland China & Asia-Pacific)
If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV1mwFLzTELL/
☕ Support Me on Ko-fi
If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk
💼 Business Contact
For collaboration or inquiries, please contact aiksk95 on WeChat.
🎥 YouTube 视频教程
想了解这个工作流到底是怎样的工具,以及如何快速启动?
视频主要介绍 工具定位、快速启动方法 和 我的构筑思路。
我们会直接在 RunningHub 上进行演示,让你第一时间看到实际效果。
👉 YouTube 教程: https://youtu.be/OjsHOyPtF0s
开始前建议尽量完整地观看视频 —— 把握整体思路会更快上手,也能少走常见弯路。
⚙️ 在线体验工作流
现在就可以在线体验,无需安装。
👉 工作流: https://www.runninghub.ai/post/2018256491320451073?inviteCode=rh-v1111
打开上方链接即可直接运行该工作流,实时查看生成效果。
如果觉得效果理想,你也可以在本地进行自定义部署。
🎁 粉丝福利: 注册即送 1000 积分,每日登录 100 积分,畅玩 4090 体验 48 G 超级性能!
📺 Bilibili 更新(中国大陆及南亚太地区)
如果你在中国大陆或南亚太地区,可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频: https://www.bilibili.com/video/BV1mwFLzTELL/
我会在 夸克网盘 持续更新模型资源:
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户,方便进行创作与学习。
