Qwen3-TTS + RVC Ultimate Pack V2 (Director's Mode)

Qwen3-TTS + RVC Ultimate Pack V2 (Director's Mode) - v1.0 - Starter Pack (GTX

# 🎬 UPDATE V2.0 (Jan 29, 2026) - DIRECTOR'S MODE

The Ultimate Voice Workflow just got a massive upgrade.

Now integrating RVC (Retrieval-based Voice Conversion) directly inside ComfyUI.

🚀 3 Modes in 1 Workflow

This isn't just an update; it's the ultimate pack. You can switch between 3 distinct modes using the Fast Bypasser:

🎙️ Voice Design (Text-to-Speech): Create high-quality voices from scratch using prompts.
👯 Classic Cloning (Audio-to-Speech): The original V1 method. Quick and easy cloning using a reference audio file.
🎭 Director's Mode (Qwen + RVC): [NEW] The advanced mode where you design the performance and paint the voice texture using RVC models.

(Watch the video above for a full tutorial on how to use the Director's Mode)

---

### 🤯 The Problem with Standard Cloning

Usually, when you clone a voice, the AI tries to copy the accent and the tone of the reference audio.

* If your reference is boring, the result is boring.

* If your reference has a heavy accent, the result will have it too.

### 💡 The Solution: Director's Mode (V2)

This workflow separates the Acting from the Timbre.

1. Direct the Actor: Use Qwen3's "Voice Design" node to generate the perfect performance (whispers, shouts, sadness, speed) using a generic high-quality voice.

2. Apply the Mask: The workflow automatically feeds that performance into RVC, which applies the target character's voice (e.g., Michael Jackson, Darth Vader, or your own) over the performance.

Result: Perfect acting, perfect character voice, zero accent bleed.

---

## 🚀 What's New in V2?

* ✅ RVC Integration: Load .pth and .index models directly in ComfyUI.

* ✅ Director's Mode: A specific group set up to pipe Qwen3 output into RVC.

* ✅ Smart Settings: Optimized Pitch, Index, and Protection settings for realistic results.

* ✅ Low VRAM Optimized: Still runs perfectly on a GTX 1060 (6GB).

* ✅ Bypass Groups: Easily toggle RVC on/off to save resources while testing prompts.

---

## ⚠️ BEFORE YOU RUN (Important)

When you load this workflow, some nodes might turn RED. This is normal!

It happens because the workflow is looking for my audio files and my RVC models.

To fix it:

1. Load Audio Node: Upload your own reference audio.

2. Load RVC Model Node: Select your own .pth and .index files (you need to download RVC voice models and put them in your ComfyUI/models/rvc folder).

---

## ⚙️ Requirements

To make the magic happen, you need these Custom Nodes (Install via ComfyUI Manager):

1. ComfyUI-Qwen3-TTS (by DarioFT) - The brain.

2. ComfyUI-RVC (or similar RVC suite) - The voice changer.

3. rgthree-comfy - For the bypass switches.

---

## 💡 How to Use (Step-by-Step)

1. Voice Design (Text-to-Speech) - (Blue Group)

- Type your text.

- Describe the acting in the prompt box (e.g., "A terrified whisper, breathing heavily").

- Generate the audio to check the performance.

2. RVC (Director's Mode) - (Purple Group)

- Enable the RVC Group using the Fast Bypasser on the left.

- Load your target voice model (e.g., Deadpool.pth).

- 🧠 SMART SETTINGS (Don't guess!):

- I included a note node inside the workflow called "🤔 How to use this".

- Copy the prompt from that note and paste it into ChatGPT, Gemini, or Grok.

- The LLM will analyze your character and give you the exact Pitch, Index, and Qwen Instructions to get the best result.

- Watch the video at 03:05 to see this in action!

---

### ❤️ Support the Project

If this workflow saved you time or improved your projects:

👍 *Thumbs Up** and Review (It helps a lot with visibility!)

⚡ *Buzz:** If you are feeling generous, some Buzz helps me test new models and create V3!

Enjoy being the Director!

@Video_Maker

Description

Initial release.

- Includes two workflows: Voice Design (Text-to-Speech) and Voice Cloning (Audio-to-Audio).

- Optimized settings for 6GB VRAM GPUs (tested on GTX 1060).

- Uses the standard 1.7B Model (no GGUF needed).

FAQ

Comments (11)

carlitajohn89931Jan 26, 2026· 2 reactions

CivitAI

is the optimized version for low end pc does degrade quality be honest, im seeking the highest quality no matter how resources it takes what u recommend?

Video_Maker

Author

Jan 26, 2026· 1 reaction

I just run some tests and the quality of voice cloning in the 0.6B version is so similar to the 1.7B i couldn't tell the difference. The only thing I noticed was a slight cadency difference in the speech pattern.

LabernethJan 29, 2026· 4 reactions

CivitAI

Good workflow. Thumbed!

Both workflows work pretty well, but how do I add emphasis to the voice cloning part of your workflow? Like speak sad, agressive, etc., because there is no "Instruct" input.

Video_Maker

Author

Jan 29, 2026· 1 reaction

That's some crazy timing! 😅 I was actually already working on V2 to solve exactly this limitation, and your comment gave me the extra motivation to finish and release it sooner. Since you nailed the exact issue I was fixing, I even featured your comment in the new explanation video, read by Qwen3 TTS obviously! (I hope you like the voice I gave you😂) Check out the Director's Mode in the new update—it does exactly what you wanted (adds emotion/instruct to clones).

LabernethJan 30, 2026· 1 reaction

@Video_Maker Looks promising and it works..., kind of... I get an error: VoiceFixerNode - Failed to download VoiceFixer models. I googled it but couldn't find a solution. Can you help me out?

Video_Maker

Author

Jan 30, 2026· 1 reaction

@Laberneth Hey! Thanks for reporting. That error happens when the node fails to auto-download the cleaning model from HuggingFace (server timeout or permission issues).

The Quick Fix: You don't strictly need that node! It's just an optional denoiser. Just select the VoiceFixerNode, press Ctrl+M to Mute/Bypass it. The workflow will work perfectly fine without it. Let me know if that gets you running!

LabernethJan 30, 2026· 1 reaction

@Video_Maker Thank you. Ya. It works without the Fixer node.

specsix381Feb 25, 2026

@Video_Maker Awesome workflow! I am confused by this statement.

"Check out the Director's Mode in the new update—it does exactly what you wanted (adds emotion/instruct to clones)."

I do not see anywhere in the workflow that to add emotion to a pre-cloned voice. I only see the RVC section which is for pre-trained RVC voices, not the clone that we just created in the workflow. What am I missing here?

Video_Maker

Author

Feb 25, 2026

@specsix381 i explain in the video, the cloning function on qwen tts can sound monotone in comparison with the voice designed ones (where you can request specific details like "happy" or "wispering" for example) so the rcv part is meant to fix that aspect. You design a voice that is similar to your target voice and after it you run this audio trough the RCV part.

specsix381Feb 25, 2026

@Video_Maker Designing voice via text to mimic the target voice is almost impossible. That is the entire point of using reference audio of the exact voice. Voices are very nuanced - not something you can just get right via a description. Thanks for the effort anyways.

Video_Maker

Author

Feb 26, 2026

@specsix381 The designed voice just needs to be "close enough" so it shines with RVC. In the video I have in the workflow description I used Chat GPT to help prompting in voice design to achive something that has the characteristics of Michael Jackson and the "magic" happens in RVC with the model trained on his voice. The downside of this is that you need to find a RCV model of the voice you're trying to clone, so if the person you're cloning doesn't have a RVC trained on their voice you're limited to the voice clone side of the workflow or you can look into training an RVC...

Workflows

Qwen

by Video_Maker

Download (Beta) View on CivitAI