Qwen3-TTS Multi-Speaker Dialogue Workflow

已按最新格式处理：不写 Title:，不写 Description:，英文标题放在整段发布文案最后。
Bilibili 链接已清理为公开播放链接：https://www.bilibili.com/video/BV132zxBsEAX/。RunningHub 已统一为 .ai 域名。
解析依据：你上传的工作流包含 FB_Qwen3TTSVoiceClonePrompt、FB_Qwen3TTSRoleBank、FB_Qwen3TTSDialogueInference、MelBandRoFormerModelLoader、MelBandRoFormerSampler、Apply Whisper、LoadAudio、LayerUtility: PurgeVRAM V2、SaveAudio 等节点，核心是“多角色音色克隆 + 角色库 + 剧本自动分配 + 合并输出多人对话音频”。
This ComfyUI workflow is designed for multi-speaker dialogue generation with Qwen3-TTS. Instead of generating only a single cloned voice, this workflow builds a small voice role system inside ComfyUI, allowing different characters to speak different lines in one script. It is suitable for AI short drama dialogue, character conversation, comedy sketches, narration scenes, audio drama tests, digital human voice production, and AI video dubbing workflows.

The core idea is simple: prepare several reference voices, extract clean vocals from each reference audio, automatically transcribe each reference voice with Whisper, build a role bank, then feed a multi-character script into the dialogue inference node. The workflow will generate a complete dialogue audio output where each speaker uses the corresponding cloned voice.

This workflow is built around Qwen3-TTS and the Qwen3-TTS ComfyUI nodes. It uses FB_Qwen3TTSVoiceClonePrompt to create voice clone prompts for each character, FB_Qwen3TTSRoleBank to bind cloned voices to role names, and FB_Qwen3TTSDialogueInference to generate the final multi-speaker dialogue. The included example uses role names such as 御姐, 小林, and 旁白, showing how different character labels can be connected to different voice references.

A key advantage of this workflow is that it turns voice cloning into a role-based dialogue pipeline. In a normal TTS workflow, you usually generate one voice at a time. If you want a conversation between several characters, you need to manually generate each line separately, manage different speakers, export many audio files, and assemble them in editing software. This workflow reduces that workload by using a script format with role names. Once the role bank is prepared, the dialogue inference node can read the script and generate the conversation automatically.

The reference voice preparation stage uses LoadAudio, MelBandRoFormer, and Apply Whisper. LoadAudio imports the reference audio sample. MelBandRoFormer separates the vocal part from background music or mixed audio. Apply Whisper transcribes the extracted speech into reference text. This is important because voice cloning quality depends not only on the reference audio, but also on the accuracy of the reference transcript. When the reference audio and reference text match well, the cloned voice is usually more stable.

MelBandRoFormer is especially useful when the reference audio is not perfectly clean. Many useful voice samples come from videos, movies, games, livestreams, or mixed audio clips. These clips may contain background music, ambience, sound effects, or room noise. The vocal separation stage helps isolate the human voice before it is used for cloning. This makes the reference signal cleaner and reduces the chance of the TTS model copying unwanted background elements.

Apply Whisper is used to automatically create the reference transcript. This makes the workflow easier to use because the user does not need to manually type every reference line. The workflow uses Whisper Large V3 with automatic language detection, which makes it useful for Chinese, English, and mixed-language reference clips. However, it is still recommended to check the transcript before final generation. If Whisper makes mistakes, manually correcting the reference text can improve voice cloning consistency.

The role bank is the central management module. FB_Qwen3TTSRoleBank can connect multiple voice clone prompts and assign each one a role name. In the included setup, three roles are used: 御姐, 小林, and 旁白. The node also provides slots for more roles, so the workflow can be expanded to handle larger casts. This is useful for AI audio drama, short video skits, NPC dialogue, character storytelling, or multi-person narration.

The final generation stage is handled by FB_Qwen3TTSDialogueInference. This node receives the role bank and a structured script. The script can contain lines such as “旁白：...”, “御姐：...”, and “小林：...”. The workflow reads each line, matches it to the corresponding role, generates speech with the correct cloned voice, inserts pauses between dialogue lines, and optionally merges the outputs into one final audio file. In the included setup, the pause time is set around 0.5 seconds, merge output is enabled, and batch generation is used for faster processing.

This workflow is especially useful for AI video creators. A lot of AI video content needs voice, but single-speaker narration is not enough for every project. Short dramas, comedy scenes, character interaction videos, product storytelling, game-style dialogue, and digital human performances often need multiple speakers. This workflow gives creators a practical way to generate those multi-role audio tracks directly inside ComfyUI.

It is also useful for testing character voice design. You can assign one voice to a narrator, another voice to a young male character, another voice to a mature female character, and more voices for additional roles. Then you can test the same script with different voice combinations. This makes it easier to build reusable audio assets for AI short films, Bilibili videos, YouTube skits, visual novels, game prototypes, and social media content.

The workflow also includes LayerUtility: PurgeVRAM V2 nodes. These are used to clear cache and unload models between stages. Audio workflows with TTS, vocal separation, and Whisper can consume a lot of VRAM, especially when several reference voices are processed in one graph. PurgeVRAM nodes help reduce memory pressure and make the workflow more suitable for online deployment or cloud environments.

Main features:

- Qwen3-TTS multi-speaker dialogue workflow

- Multiple cloned voices in one ComfyUI graph

- Role-based speaker management

- FB_Qwen3TTSVoiceClonePrompt for individual voice reference setup

- FB_Qwen3TTSRoleBank for character-role binding

- FB_Qwen3TTSDialogueInference for full script generation

- Supports narrator and multiple character voices

- MelBandRoFormer vocal separation

- Whisper Large V3 automatic transcription

- LoadAudio reference voice input

- Pause control between dialogue lines

- Merge output option for final full dialogue audio

- Batch generation support

- PurgeVRAM nodes for better memory management

- SaveAudio export for final audio output

Recommended use cases:

AI short drama dialogue, multi-character dubbing, audio drama production, comedy dialogue generation, character skit voiceover, visual novel dialogue, game NPC voice lines, digital human conversation, AI animation dubbing, narration with multiple roles, Bilibili and YouTube short video voice production, role-based TTS testing, voice style comparison, and ComfyUI audio workflow research.

Suggested workflow:

Start by preparing reference audio for each character. Each reference voice should be clear, stable, and preferably contain only one speaker. Short clips with clean speech and minimal background noise are usually better than long messy recordings. If the reference contains background music or sound effects, use the MelBandRoFormer stage to extract the vocal track.

After loading each reference audio, let Whisper transcribe the vocal result. Check the transcript carefully. If the transcription is incorrect, correct the reference text before running the final voice clone prompt. Accurate reference text helps the model understand the speaker’s voice more reliably.

Next, create one voice clone prompt for each character. Each FB_Qwen3TTSVoiceClonePrompt node stores the reference audio, reference text, model choice, device, precision, and voice prompt settings. In this workflow, the model choice is set to 1.7B, the device is set to automatic, and the precision is set to bf16. This setup is useful for balancing quality and performance.

Then connect the character voices into the role bank. Give each role a clear name, such as Narrator, Girl, Boy, Host, Customer, Robot, Villain, or any custom character name. The role name must match the speaker labels used in the script. If the role name in the script does not match the role bank, the workflow may not assign the correct voice.

Write the script in a structured format. Each line should begin with the character name, followed by a colon, then the dialogue. For example:

Narrator: The city was still awake at midnight.
Girl: Why are you standing under the streetlight?
Boy: I am waiting for a signal from space.

This format helps the dialogue inference node know which voice should speak each line. Keep each line reasonably short for better rhythm and more stable generation. Long monologues can be split into several shorter lines.

Use pause_seconds to control the gap between speakers. A short pause creates a faster comedy rhythm. A longer pause creates a more cinematic or dramatic feeling. In the included setup, 0.5 seconds is a good general-purpose value. You can increase it for narration-heavy scenes or reduce it for quick banter.

Enable merge_outputs when you want one complete dialogue audio file. This is useful for video editing because you can export a single final track and sync it with images, animation, lip-sync video, or subtitles. If you want to edit each line separately, you can disable merging and export separate segments.

For best results, test with a short script first. After confirming that every role speaks with the correct voice, expand the script. If one character sounds wrong, check the role name, reference audio quality, Whisper transcript, and voice clone prompt connection. If the rhythm feels unnatural, adjust the pause time and split long lines into shorter dialogue units.

This workflow is designed for creators who need practical multi-character voice generation inside ComfyUI. It combines voice reference preparation, vocal separation, automatic transcription, role binding, dialogue parsing, multi-speaker synthesis, memory cleanup, and final audio export into one usable pipeline. It is especially useful for AI creators who want to produce short dramas, comedy conversations, digital human skits, character narration, and voice assets for video generation workflows.

Responsible use note: only clone voices that you own, have permission to use, or are legally allowed to reproduce. Do not use this workflow to impersonate real people without consent, mislead audiences, commit fraud, or create deceptive audio. For public-facing content, it is recommended to disclose when a voice is AI-generated.

🎥 YouTube Video Tutorial

Want to know what this workflow actually does and how to start fast?
This video explains what the tool is, how to launch the workflow instantly, and shares my core design logic — no local setup, no complicated environment.
Everything starts directly on RunningHub, so you can experience it in action first.
👉 YouTube Tutorial: https://youtu.be/iHM2VOtUAZ0

Before you begin, I recommend watching the video thoroughly — getting the full context helps you understand the tool faster and avoid common detours.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.
👉 Workflow: https://www.runninghub.ai/post/2015350699822944258/?inviteCode=rh-v1111

If the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.
📺 Bilibili Video: https://www.bilibili.com/video/BV132zxBsEAX/

☕ Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.
Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.
👉 Ko-fi: https://ko-fi.com/aiksk

💼 Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

🎥 YouTube 视频教程

想了解这个工作流到底是怎样的工具，以及如何快速启动？
视频主要介绍 工具定位、快速启动方法 和 我的构筑思路。
我们会直接在 RunningHub 上进行演示，让你第一时间看到实际效果。
👉 YouTube 教程： https://youtu.be/iHM2VOtUAZ0

开始前建议尽量完整地观看视频 —— 把握整体思路会更快上手，也能少走常见弯路。

⚙️ 在线体验工作流

现在就可以在线体验，无需安装。
👉 工作流： https://www.runninghub.ai/post/2015350699822944258/?inviteCode=rh-v1111

打开上方链接即可直接运行该工作流，实时查看生成效果。
如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利： 注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。
📺 B站视频： https://www.bilibili.com/video/BV132zxBsEAX/

我会在 夸克网盘 持续更新模型资源：
👉 https://pan.quark.cn/s/20c6f6f8d87b
这些资源主要面向本地用户，方便进行创作与学习。
Description

Details

Files

qwen3TTSMultiSpeaker_v10.zip

Mirrors