LTX2.3 3.5 Single-Person Digital Human Audio-Driven Talking and Singing Workflow

Watch the full video first if you want to understand how this LTX2.3 3.5 single-person digital human audio-driven workflow works in practice. The video shows how one portrait image or avatar frame can be animated by an audio file, creating a talking or singing digital human with mouth movement, facial performance, and staged high-definition video refinement.

This ComfyUI workflow is designed for single-person digital human generation using LTX2.3 3.5. Its main purpose is to turn a still avatar image into a talking or singing video driven by audio. Compared with a normal image-to-video workflow, this graph focuses more on mouth opening, lip movement, audio timing, facial stability, identity preservation, and final video polish.

The workflow is built around ltx-2.3-22b-dev-dare-ties-distilled-1.1.safetensors as the main LTX2.3 video checkpoint. The text encoding route uses gemma_3_12B_it_fp8_e4m3fn.safetensors through the LTX AV text encoder loader. The workflow also uses LTXVAudioVAE to build and decode the audio latent structure, allowing the audio route and video route to remain connected during generation.

The input section uses one avatar or portrait image as the visual reference. The image is preprocessed through LTXVPreprocess and then used by LTXVImgToVideoConditionOnly to guide the video generation. This helps the generated character stay closer to the original avatar, keeping the face, general identity, hairstyle, outfit, and composition more consistent.

The audio section is one of the most important parts of the workflow. The uploaded audio is used as the timing source for the video. Audio Duration is used to calculate the target length, and the workflow includes a tail-frame crop mechanism so the decoded result can be cut according to the audio-driven frame count. This prevents the final MP4 from keeping unnecessary overflow frames after the actual audio performance ends.

The workflow also includes MelBandRoFormer audio processing. This makes the audio route more suitable for speech or singing material, especially when the creator wants the digital human to follow vocals more clearly. The workflow is suitable for short voice clips, AI songs, character performance videos, creator avatars, virtual presenters, and talking-head style content.

The LoRA route is arranged for lip-sync and performance control. The workflow includes an IC LipDub LoRA route, with ltx-2.3-22b-ic-lora-lipdub-0.9.safetensors used as the mouth-opening and lip-performance control module. It also keeps a full staged LoRA structure for the single-audio digital human pipeline, separating the first-stage similarity chain, second-stage refinement chain, and third-stage lip-preserving high-definition chain.

The generation route uses a three-stage rendering structure. Stage 1 builds the base talking or singing motion from the avatar and audio condition. Stage 2 performs latent upscaling and refinement through the LTX2.3 spatial upscaler. Stage 3 focuses on final high-definition cleanup while preserving mouth movement and audio alignment. The final section decodes the video through VAEDecodeTiled, decodes audio through LTXVAudioVAEDecode, and exports the result through video output nodes.

Main features:

LTX2.3 3.5 single-person digital human workflow
Avatar image to talking video generation
Audio-driven mouth opening and performance
Suitable for speech, singing, and character presentation
ltx-2.3-22b-dev-dare-ties-distilled-1.1.safetensors support
Gemma3 FP8 text encoder support
LTXVAudioVAE audio latent route
LTXVConditioning video-aware prompt control
LTXVPreprocess avatar image preparation
LTXVImgToVideoConditionOnly image-guided generation
IC LipDub LoRA for lip-sync control
MelBandRoFormer audio processing support
Audio Duration based frame calculation
Tail-frame crop after decoding
Three-stage rendering structure
Stage 1 strong similarity generation
Stage 2 latent upscale refinement
Stage 3 lip-preserving HD polish
LTXVSeparateAVLatent audio-video split
LTXVConcatAVLatent audio-video reconstruction
ltx-2.3-spatial-upscaler-x2-1.1 support
VAEDecodeTiled memory-friendly decoding
LTXVAudioVAEDecode final audio decode
CreateVideo / SaveVideo / VHS final output

Suggested workflow:

Prepare one clear portrait or avatar image first. A front-facing or slightly angled face works best, especially when the mouth area is visible and not covered. Then upload a clean audio file. Use short speech or singing clips first before testing longer performances. Write a prompt that describes the character, expression, lighting, camera framing, and performance mood, but avoid conflicting motion instructions. If the mouth movement is too weak, strengthen the prompt around speaking, singing, natural lip movement, and expressive facial performance. If identity drift appears, reduce aggressive style changes and keep the prompt closer to the original avatar. For best results, use clean vocals, stable portraits, and moderate performance intensity.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2068714653911445505?inviteCode=rh-v1111

If the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV1xw7F6XE7K/

☕ Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk

💼 Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

⚙️打开下方链接即可在线体验，无需安装。
👉 工作流： https://www.runninghub.ai/post/2068714653911445505?inviteCode=rh-v1111
如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利：注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频： https://www.bilibili.com/video/BV1xw7F6XE7K/

我会在夸克网盘持续更新模型资源：
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户，方便进行创作与学习。

Description

Details

Files

ltx2335SinglePersonDigital_v10.json

Mirrors