Qwen3-TTS Voice Cloning Workflow - CivArchive (CivitAI Archive)

Qwen3-TTS Voice Cloning Workflow is a ComfyUI audio generation workflow designed for reference-based voice cloning, text-to-speech generation, vocal tone imitation, and multilingual narration production using Qwen3-TTS. The workflow allows users to upload a reference audio sample, extract the vocal part, automatically transcribe the reference speech, and then generate new speech using the cloned voice characteristics.

This workflow is built around Qwen/Qwen3-TTS-12Hz-1.7B-Base. It uses Qwen3TTSModelLoader to load the TTS model, Qwen3TTSVoiceClone to generate cloned speech, MelBandRoFormer to separate vocals from mixed audio, Whisper Large V3 to transcribe the reference audio, and SaveAudio nodes to export the final generated audio. The goal is to make voice cloning more practical inside ComfyUI, especially for creators who want to test narration, character dubbing, voiceover generation, audiobook-style speech, short video narration, and multilingual audio production.

The core logic is simple: upload a reference voice, clean the audio, extract the human vocal track, automatically identify the spoken reference text, then provide a new script for Qwen3-TTS to speak in a similar voice style. This avoids the need to manually type the reference transcript in many cases, because Whisper can automatically produce the reference text from the uploaded audio. This makes the workflow easier to use and more suitable for fast testing.

The workflow includes several repeated voice cloning branches, which means users can test different reference voices, different scripts, and different languages inside one graph. Each branch follows a similar structure: LoadAudio imports the reference audio, MelBandRoFormer separates vocals from instruments or background music, Apply Whisper transcribes the vocal audio, CR Prompt Text provides the target text, Qwen3TTSVoiceClone generates the cloned voice, and SaveAudio exports the final result.

MelBandRoFormer is an important part of this workflow. Many reference audio samples are not perfectly clean. They may contain music, ambience, background noise, film sound effects, or mixed audio. MelBandRoFormer helps separate the vocal component from the instrumental or background layer, giving the TTS voice cloning node a cleaner reference signal. A cleaner reference voice usually improves voice consistency, pronunciation stability, and tone similarity.

Whisper transcription is another useful part of the workflow. For voice cloning, the model often benefits from knowing what the reference speaker actually said. If the reference text does not match the reference audio, voice cloning can become less stable. By using Apply Whisper with Whisper Large V3, the workflow can automatically generate the reference transcript. This is especially useful for long samples, multilingual clips, dialogue lines, and reference audio downloaded from other production materials.

The Qwen3TTSVoiceClone node is the main generation module. It receives the Qwen3-TTS model, the processed reference audio, the new target text, and the reference transcript. It then generates a new audio output using the cloned voice characteristics. The node also includes practical controls such as language mode, maximum generation token count, seed, and model unloading behavior. In the included setup, the language is set to automatic, the maximum generation token count is set to 2048, and the model runs on CUDA with fp16 precision for faster GPU inference.

This workflow is useful for both Chinese and English speech generation. The example prompt sections include Chinese dramatic lines, Chinese narration, tongue-twister style testing, and English dialogue-style text. This makes the workflow suitable for testing emotion, rhythm, pronunciation, long-form narration, character performance, and cross-language voice behavior. Users can compare how the cloned voice handles different scripts, pacing, and emotional intensity.

The workflow is especially useful for AI video creators. Many AI video workflows can generate images, characters, and motion, but still need natural voiceover. This Qwen3-TTS workflow can provide the audio layer for AI short films, digital human videos, character dialogue, product narration, tutorial voiceovers, game NPC lines, cinematic trailers, and social media content. It can also be paired with lip-sync workflows, digital human workflows, or video editing pipelines.

Main features:

- Qwen3-TTS voice cloning workflow for ComfyUI

- Uses Qwen/Qwen3-TTS-12Hz-1.7B-Base

- Reference audio to cloned voice generation

- Text-to-speech generation with reference voice style

- MelBandRoFormer vocal separation

- Whisper Large V3 automatic transcription

- CR Prompt Text script input

- Qwen3TTSVoiceClone generation node

- Multiple voice cloning branches in one workflow

- Supports Chinese and English voice testing

- SaveAudio output for final generated speech

- CUDA + fp16 model loading setup

- Useful for narration, dubbing, dialogue, and AI video voiceover

- Suitable for RunningHub online use and ComfyUI local deployment

Recommended use cases:

Voice cloning tests, AI narration, short video voiceover, digital human speech, character dubbing, audiobook-style narration, game NPC dialogue, cinematic trailer voice, multilingual TTS testing, Chinese text-to-speech, English text-to-speech, AI film dialogue, lip-sync audio preparation, podcast-style synthetic voice, educational content narration, product explanation voiceover, and ComfyUI audio workflow testing.

Suggested workflow:

Start by preparing a clean reference audio file. A short clip with clear speech, stable volume, and minimal background noise usually works best. If the audio contains music or ambience, the MelBandRoFormer branch can help isolate the vocal part, but the result will still depend on the quality of the original recording. For best results, use a reference voice with clear pronunciation and limited overlapping sound.

Load the reference audio into the workflow. The audio will pass through MelBandRoFormer to separate the voice from background audio. The extracted vocal track is then sent to Whisper for automatic transcription. Check the Whisper output before generation. If the transcript is wrong, manually correct it, because accurate reference text can improve voice cloning stability.

Write the new target text in the CR Prompt Text node. This is the script you want the cloned voice to speak. For short tests, use one or two sentences first. For longer narration, split the script into smaller sections. This helps reduce rhythm problems, long pauses, mispronunciation, or unstable emotional delivery.

Choose the language mode. The included workflow uses automatic language detection, which is useful for mixed testing. For more controlled results, you can specify the target language if your node setup supports it. Chinese dramatic monologue, English dialogue, narration, and tongue-twister testing can all be used to evaluate different aspects of the cloned voice.

Use the seed control for repeatable testing. If you find a result with good tone and rhythm, keep the seed fixed. If the delivery sounds unnatural, randomize the seed and test again. The maximum generation token count controls how long the generation can be, but longer scripts may be less stable, so shorter segments are recommended for production use.

After generation, use SaveAudio to export the final result. You can then use the generated audio in video editing software, digital human workflows, lip-sync workflows, or AI short film production. For best production quality, do a light post-processing pass such as loudness normalization, noise reduction, EQ, or compression if needed.

For emotional voice cloning, the reference audio should contain the type of emotion you want to reproduce. A calm reference voice is better for narration. A dramatic reference voice is better for character monologues. A fast and clear reference voice is better for tongue-twister or pronunciation stress tests. The model can imitate vocal tone better when the reference sample already contains similar rhythm and emotion.

For responsible use, only clone voices that you own, have permission to use, or are legally allowed to reproduce. Do not use this workflow to impersonate real people without consent, mislead audiences, commit fraud, or create deceptive audio. For public-facing content, it is recommended to clearly disclose synthetic or AI-generated voice when appropriate.

This workflow is designed as a practical Qwen3-TTS voice cloning pipeline for ComfyUI users. It combines audio loading, vocal separation, automatic transcription, reference-based voice cloning, and audio export into one graph. It is useful for creators who need fast AI voice production, voiceover testing, character audio generation, and reusable narration assets for image, video, and digital human workflows.

🎥 YouTube Video Tutorial

Want to know what this workflow actually does and how to start fast?

This video explains what the tool is, how to launch the workflow instantly, and shares my core design logic — no local setup, no complicated environment.

Everything starts directly on RunningHub, so you can experience it in action first.

👉 YouTube Tutorial: https://youtu.be/iHM2VOtUAZ0

Before you begin, I recommend watching the video thoroughly — getting the full context helps you understand the tool faster and avoid common detours.

⚙️ RunningHub Workflow

Try the workflow online right now — no installation required.

👉 Workflow: https://www.runninghub.ai/post/2014976486570205186?inviteCode=rh-v1111

If the results meet your expectations, you can later deploy it locally for customization.

🎁 Fan Benefits: Register to get 1000 points + daily login 100 points — enjoy 4090 performance and 48 GB super power!

📺 Bilibili Updates (Mainland China & Asia-Pacific)

If you’re in the Asia-Pacific region, you can watch the video below to see the workflow demonstration and creative breakdown.

📺 Bilibili Video: https://www.bilibili.com/video/BV132zxBsEAX/

☕ Support Me on Ko-fi

If you find my content helpful and want to support future creations, you can buy me a coffee ☕.

Every bit of support helps me keep creating — just like a spark that can ignite a blazing flame.

👉 Ko-fi: https://ko-fi.com/aiksk

💼 Business Contact

For collaboration or inquiries, please contact aiksk95 on WeChat.

🎥 YouTube 视频教程

想了解这个工作流到底是怎样的工具，以及如何快速启动？

视频主要介绍工具定位、快速启动方法和我的构筑思路。

我们会直接在 RunningHub 上进行演示，让你第一时间看到实际效果。

👉 YouTube 教程： https://youtu.be/iHM2VOtUAZ0

开始前建议尽量完整地观看视频 —— 把握整体思路会更快上手，也能少走常见弯路。

⚙️ 在线体验工作流

现在就可以在线体验，无需安装。

👉 工作流： https://www.runninghub.ai/post/2014976486570205186?inviteCode=rh-v1111

打开上方链接即可直接运行该工作流，实时查看生成效果。

如果觉得效果理想，你也可以在本地进行自定义部署。

🎁 粉丝福利：注册即送 1000 积分，每日登录 100 积分，畅玩 4090 体验 48 G 超级性能！

📺 Bilibili 更新（中国大陆及南亚太地区）

如果你在中国大陆或南亚太地区，可以通过下方视频查看该工作流的实测效果与构思讲解。

📺 B站视频： https://www.bilibili.com/video/BV132zxBsEAX/

我会在夸克网盘持续更新模型资源：

👉 https://pan.quark.cn/s/20c6f6f8d87b

这些资源主要面向本地用户，方便进行创作与学习。

Description

Details

Files

qwen3TTSVoiceCloning_v10.zip

Mirrors