LongCat TTS Workflows: Text-to-Speech, Single & Dual Voice Cloning for Film Dubbing

Click here to try online first :

Workflow: Text-to-Speech – LongCat AudioDiT TTS (Auto Translation)

Experience link: https://www.runninghub.ai/post/2039728380941176834/?inviteCode=rh-v1401

Workflow: Single-Person Voice Cloning – LongCat-AudioDiT (Auto Translation)

Experience link: https://www.runninghub.ai/post/2039728236652924929/?inviteCode=rh-v1401

Workflow: Two-Person Voice Cloning – LongCat-AudioDiT – Dialogue – Script – Screenplay

Experience link: https://www.runninghub.ai/post/2039728409407918081/?inviteCode=rh-v1401

Workflow: All-Purpose Image Pro – Text-to-Image – Single, Double, Triple, Quadruple Images – Image Editing

Experience link: https://www.runninghub.ai/post/2026244873988345857/?inviteCode=rh-v1401

Workflow: Lip-Sync Speaking & Singing – LTX2.3 Image-to-Digital Human – Auto Expansion – Module Optimization – No Subtitles

Experience link: https://www.runninghub.ai/post/2038618856104665090/?inviteCode=rh-v1401

Workflow: AA – Various Small Tools for Image, Audio, Video Processing (Continuously Updated)

Experience link: https://www.runninghub.ai/post/2027021102093967362/?inviteCode=rh-v1401

This set includes three ComfyUI workflows based on Meituan's open-source LongCat – a SOTA-level TTS model with high timbre fidelity and fast inference.

Workflows included:

Text-to-Speech – Convert any text to speech. Supports BF16/FP32. More steps = better quality.

Single Voice Cloning – Upload a reference voice, write your target text, and generate cloned speech. Includes ASR (Qwen) for automatic transcription of the sample.

Dual Voice Cloning – Generate a dialogue between two speakers. Format the prompt properly (speaker tags) for the model to recognize.

Important notes:

⚠️ Numbers – Do NOT use Arabic numerals (e.g., 123). Always write numbers in their spoken Chinese form (e.g., "一百二十三"). Otherwise the output will be garbled.

Voice loudness – The workflow automatically normalizes loudness to avoid pops/clipping. Keep the default reduction amount unless your source is extremely loud (red waveform in editors).

Model precision – BF16 works well; FP32 requires ~20GB VRAM.

Random seed – Controls some variation (cannot specify gender/tone directly).

Recommended for:

Film dubbing, AI-generated dialogue scenes, podcast-style dual-voice content.

Combine with LTX 2.3 for image-to-digital human animation to create cinematic conversation scenes. For best results, use single cloning per fixed shot – dual cloning may cause unwanted frame transitions in LTX 2.3.

Requirements:

ComfyUI with custom nodes (ASR, LongCat, etc.)

LongCat model files (see project page in comments)

Timestamps (in video):

Text-to-speech setup → Voice cloning → Dual dialogue → Integration with LTX 2.3

Enjoy making AI-powered film shorts! Feel free to ask questions below.

Description

Details

Files

longcatTTSWorkflowsTextTo_v10.zip

Mirrors