Click here to try online first :
Workflow: Text-to-Speech – LongCat AudioDiT TTS (Auto Translation)
Experience link: https://www.runninghub.ai/post/2039728380941176834/?inviteCode=rh-v1401
Workflow: Single-Person Voice Cloning – LongCat-AudioDiT (Auto Translation)
Experience link: https://www.runninghub.ai/post/2039728236652924929/?inviteCode=rh-v1401
Workflow: Two-Person Voice Cloning – LongCat-AudioDiT – Dialogue – Script – Screenplay
Experience link: https://www.runninghub.ai/post/2039728409407918081/?inviteCode=rh-v1401
Workflow: All-Purpose Image Pro – Text-to-Image – Single, Double, Triple, Quadruple Images – Image Editing
Experience link: https://www.runninghub.ai/post/2026244873988345857/?inviteCode=rh-v1401
Workflow: Lip-Sync Speaking & Singing – LTX2.3 Image-to-Digital Human – Auto Expansion – Module Optimization – No Subtitles
Experience link: https://www.runninghub.ai/post/2038618856104665090/?inviteCode=rh-v1401
Workflow: AA – Various Small Tools for Image, Audio, Video Processing (Continuously Updated)
Experience link: https://www.runninghub.ai/post/2027021102093967362/?inviteCode=rh-v1401
This set includes three ComfyUI workflows based on Meituan's open-source LongCat – a SOTA-level TTS model with high timbre fidelity and fast inference.
Workflows included:
Text-to-Speech – Convert any text to speech. Supports BF16/FP32. More steps = better quality.
Single Voice Cloning – Upload a reference voice, write your target text, and generate cloned speech. Includes ASR (Qwen) for automatic transcription of the sample.
Dual Voice Cloning – Generate a dialogue between two speakers. Format the prompt properly (speaker tags) for the model to recognize.
Important notes:
⚠️ Numbers – Do NOT use Arabic numerals (e.g., 123). Always write numbers in their spoken Chinese form (e.g., "一百二十三"). Otherwise the output will be garbled.
Voice loudness – The workflow automatically normalizes loudness to avoid pops/clipping. Keep the default reduction amount unless your source is extremely loud (red waveform in editors).
Model precision – BF16 works well; FP32 requires ~20GB VRAM.
Random seed – Controls some variation (cannot specify gender/tone directly).
Recommended for:
Film dubbing, AI-generated dialogue scenes, podcast-style dual-voice content.
Combine with LTX 2.3 for image-to-digital human animation to create cinematic conversation scenes. For best results, use single cloning per fixed shot – dual cloning may cause unwanted frame transitions in LTX 2.3.
Requirements:
ComfyUI with custom nodes (ASR, LongCat, etc.)
LongCat model files (see project page in comments)
Timestamps (in video):
Text-to-speech setup → Voice cloning → Dual dialogue → Integration with LTX 2.3
Enjoy making AI-powered film shorts! Feel free to ask questions below.
Description
LongCat audio DIT