Watch the full video first if you want to understand how this LTX 2.3 single-person digital human workflow works in practice. The video shows how one character image and one audio file can be turned into a stable speaking or singing digital human clip, how the new 10-second likeness system improves identity consistency, and how to run the workflow online without rebuilding a complex local ComfyUI environment.
This ComfyUI workflow is designed for LTX 2.3 single-person digital human generation. Its main purpose is to take a still portrait or character image, combine it with an audio track, and generate an audio-driven video where the character can speak, sing, or perform subtle digital-human motion while keeping the face, clothing, framing, and visual identity stable.
The workflow is built around the LTX 2.3 distilled 1.1 route. It uses the LTX 2.3 video checkpoint, Gemma3 fp8 text encoder, LTX Audio VAE, LTXVConditioning, LTXVImgToVideoConditionOnly, VBVR I2V LoRA, LTXVConcatAVLatent, LTXVSeparateAVLatent, ManualSigmas, CFGGuider, SamplerCustomAdvanced, LTXVLatentUpsampler, tiled decoding, and final video output. The graph also includes MelBandRoFormer audio processing, which makes the workflow more useful for singing, vocal-driven clips, and cleaner audio-based digital human generation.
The input structure is simple but practical: one image controls the character identity, and one audio file controls the speaking or singing duration. The workflow reads the audio duration and uses that timing to build the video generation structure. This reduces manual frame calculation problems and makes the generated clip easier to align with the audio.
The most important update is the 10-second likeness preservation system. The workflow includes 10S likeness guide and anchor modules, which are used to keep the character face and identity closer to the original reference image during the generation process. This is especially important for digital human videos, because even small changes in facial structure, eyes, mouth shape, or hairstyle can make the character feel inconsistent.
The generation pipeline uses a three-stage rendering structure. The first stage builds the initial motion and character composition. The second stage performs latent-space upscaling and stronger identity stabilization, using the 10S likeness guidance and weak anchor logic to reduce drifting. The third stage applies final high-definition refinement with lighter identity constraint, improving detail while trying not to destroy the established face and motion.
Compared with ordinary image-to-video workflows, this graph is more suitable for digital human production. A basic I2V workflow can animate a portrait, but it may not handle audio duration, mouth movement, singing rhythm, long identity consistency, and HD refinement well. This workflow combines image reference control, audio-aware generation, likeness preservation, latent upscaling, and staged sampling into one reusable pipeline.
This workflow is suitable for AI presenters, virtual hosts, singing characters, talking avatars, product explainers, narration videos, short drama characters, Bilibili demonstrations, YouTube content, RunningHub releases, and Civitai workflow publishing.
Main features:
LTX 2.3 single-person digital human workflow
One image + one audio input
Speaking, narration, and singing video generation
LTX 2.3 distilled 1.1 checkpoint route
Gemma3 fp8 text encoder
LTX Audio VAE support
Audio Duration automatic timing logic
MelBandRoFormer audio processing support
LTXVImgToVideoConditionOnly image guidance
VBVR I2V LoRA for video stability
10S LikenessGuide identity preservation
10S LikenessAnchor weak anchor control
Three-stage rendering and HD refinement
LTXVLatentUpsampler high-resolution transition
AV latent concatenation and separation
Suggested workflow:
Prepare a clean single-person image first. The face should be clear, the mouth area should not be blocked, and the lighting should be stable. Then prepare a clean audio file. For口播, use clear speech with low background noise. For singing, use a cleaner vocal track whenever possible, because the mouth and expression will follow the audio rhythm more easily. Load the image and audio into the workflow, then write a prompt describing the character, expression, camera framing, lighting, speaking style, and motion intensity. Start with a short test first. If the face drifts, keep the 10S likeness modules active and reduce aggressive motion wording. If the result is too static, add subtle head movement, natural blinking, and controlled mouth performance. After the first stage is stable, continue into latent upscaling and final HD refinement.
⚙️ RunningHub Workflow
Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2061362758670708737?inviteCode=rh-v1111
If the results meet your expectations, you can later deploy it locally for customization.
🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!
📺 Bilibili Updates (Mainland China & Asia-Pacific)
If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV1nVVr6QEd8/
☕ Support Me on Ko-fi
If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk
💼 Business Contact
For collaboration or inquiries, please contact aiksk95 on WeChat.
⚙️打开下方链接即可在线体验,无需安装。
👉 工作流: https://www.runninghub.ai/post/2061362758670708737?inviteCode=rh-v1111
如果觉得效果理想,你也可以在本地进行自定义部署。
🎁 粉丝福利: 注册即送 1000 积分,每日登录 100 积分,畅玩 4090 体验 48 G 超级性能!
📺 Bilibili 更新(中国大陆及南亚太地区)
如果你在中国大陆或南亚太地区,可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频: https://www.bilibili.com/video/BV1nVVr6QEd8/
我会在 夸克网盘 持续更新模型资源:
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户,方便进行创作与学习。