Changelog
Version 1.0.3: Connected both steps so no more re-uploading is required. Just upload your video in Step 1 and hit Run.
Version 1.0.2: Changed VHS nodes to VHS ffmpeg nodes to avoid color drift (thank you LastAssignment). Also changed FPS flow from 24 to 25 to more closely align to MMAudio specs.
Version 1.0.1: RIFE Group output was set to 8fps by accident. Changed it to 24fps
Version 1.0: Initial release
A TRIBUTE TO GOONERS EVERYWHERE
Your WAN 2.2 video is great. It looks awesome. But where's the sound? We moved from images to videos, and WAN 2.2 is incredible for video. The missing piece...AUDIO!
This is my first article ever, so I'm sorry if I made any mistakes. Please leave a comment if I've made an error or if you need any help. For your reference, I'm running:
ComfyUI 0.3.68
Torch 2.9
CUDA 13
Python 3.13.9
Sage Attention 2.2
NVIDIA 5070 Ti (16gb vram)
And here are the custom nodes (3 in total):
ComfyUI-VideoHelperSuite 1.7.7 (https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite)
ComfyUI-MMAudio Nightly (https://github.com/kijai/ComfyUI-MMAudio)
I recommend manually git cloning this node pack into your /ComfyUI/models/custom_nodes folder and then installing the requirements.txt file using your embedded python. I'm on portable Comfy, so the command would look something like this:
"C:\ComfyUI\python_embeded\python.exe" -m pip install -r "C:\ComfyUI\ComfyUI\custom_nodes\ComfyUI-MMAudio\requirements.txt"
ComfyUI-VFI Unknown (https://github.com/GACLove/ComfyUI-VFI)
I think there's a more popular RIFE custom node that a lot of other people use, but Icouldn't figure out how to get fractional multiples for interpolation (16 -> 25fps is a ~1.5x interpolation), but this node allows it.
Onto the workflow...
------------------------------------
This workflow handles two jobs:
Fix WAN 2.2’s native 16fps output by interpolating it to 25fps with RIFE.
Generate synced audio with MMAudio using the final 25fps video.
The setup is plug-and-play. Drop in your WAN video → interpolate → feed it into MMAudio → get synced output. The included notes explain the reasoning for FPS, step settings, and seed behavior.
What this workflow covers:
RIFE interpolation from 16 → 25 fps.
MMAudio sampler
Upon some further testing, 50-100 steps works well. The node runs pretty fast in general, and it's also worthwhile toying with CFG (4.5 - 8). 100 steps and CFG 8 works well for high-quality output and better prompt adherence.
Automatic audio + video combine at 25fps.
Optional re-interpolation afterward if you want 30fps+ output.
You can plug your finished 25fps video into the 'Step 1: Rife Interpolation' group and just change the 'source_fps' to 25 and the 'target_fps' to 30.
Required MMAudio files
Download all of these into:
ComfyUI/models/mmaudio
MMAudio NSFW Model (fine-tuned off the base model)
MMAudio VAE (fp16)
MMAudio Synchformer (fp16)
https://huggingface.co/Kijai/MMAudio_safetensors/resolve/main/mmaudio_synchformer_fp16.safetensors
MMAudio CLIP Encoder (fp16)
Nvidia BigVGAN v2 24KHz 100band 512x
This seems to be required for MMAudio to work. You can manually download all the files, git clone, or use the HuggingFace CLI tool (huggingface-cli repo clone URL). The repo should be placed in the ComfyUI/models/mmaudio folder.
https://huggingface.co/nvidia/bigvgan_v2_44khz_128band_512x
Bonus
Once you've created a good MMAudio track, there are some further steps you can take depending on what you'd like to create.
1. Import your audio/video into some type of software (CapCut/Shotcut) and layer on some music in the background. I've done this with a few of my videos. I added a 'radio' filter to make it seem like the music was kinda tinny and playing in the background.
2. Layer other audio tracks alongside the NSFW audio track. You can see KaptainSisay very elegantly did something like that here (https://civarchive.com/images/110700679)
Description
Connected both steps so no more re-uploading is required. Just upload your video in Step 1 and hit Run.
FAQ
Comments (85)
Thank you for this, it's great! Working well on a portable install, takes around 20-30s to generate a 7s audio clip.
Do you have any advice for improving the audio quality once I get some that i like? Like, is there a way to essentially "upscale" the audio after generation?
automatic synched audio generated? that's some wan 2.5 level stuff :)
i added fast group bypasser switch, and put rrfe in its own group so it didn't redo rrfe every time i ran mmaudio
it would be nice if you can add an nsfw qwen video describer then feed it the information you want to extract just for MMaudio, like "explain the video in a very concise way" , i always have problems finding the one that reads nsfw content.
anyone tested this with 8gb vram?
any prompt examples?
Kapiere ich nicht. Irgendwie ist der Workflow nicht fertig angeschlossen. Da blinkt bei "Video Combine" der Image Anschluss auf. Wohin damit? Hab erst angefangen mit Videos :)
lol i literally cannot find the output. any idea where it could be? it's not where it's supposed to be...
got the error:
**error solved**
error(s) in loading state_dict for MMAudio: Missing key(s) in state_dict: "clip_input_proj.2.w1.weight", "clip_input_proj.2.w2.weight", "clip_input_proj.2.w3.weight", "text_input_proj.2.w1.weight", "text_input_proj.2.w2.weight", "text_input_proj.2.w3.weight". Unexpected key(s) in state_dict: "clip_input_proj.1.w1.weight", "clip_input_proj.1.w2.weight", "clip_input_proj.1.w3.weight", "text_input_proj.1.w1.weight", "text_input_proj.1.w2.weight", "text_input_proj.1.w3.weight". size mismatch for t_embed.mlp.0.weight: copying a param with shape torch.Size([896, 256]) from checkpoint, the shape in current model is torch.Size([896, 896])
**error was outdated nodes from nodes list tab, update nodes, double check the model name
Even a braindead Degen like myself got this to work. You're an absolute legend m8
nice but has anyone had any luck prompting out the moaning and chinese speaking?
Bro, how do video in 60 fps? The sound is really bad and doesn't match the video. But at 25-30 fps, everything is fine. Could you help or tell me what's wrong?
this looks great. Is it possible to always get the same voice? Or maybe that requires training with a dataset with just the one voice?
This is so simple and quick to use that I am kicking myself for not using it sooner.
Perfect workflow. Tips I just discover, type just "music" in negative prompt and you will get clearer result more often
Not sure I get it, but why final video combine node produces 3 files - original video, image (last frame?) and video with sound attached. Is there a way to save only the final video with sound?
Please fix the output location to export into default "output" folders, rather than dumping to the appdata\local\temp\etc. directories.
All your need to do to is enable the 'save_output' toggle on the final Video Combine node. Then the output video will save into /output
the RIFEInterpolation custom node seems to be missing for me, and I can't seem to find it on the web to download separately. I have found "RIFE VFI" nodes, but I don't think that's the same thing as what's needed for this workflow.
https://github.com/GACLove/ComfyUI-VFI
git clone this repo into your /custom_nodes folder
@SeoulSeeker thanks!
I got this error:
'audio_input_proj.0.bias'
DM me, I need more information to help
so i dont have ComfyUI/models/mmaudio ? So where do i put the audio files
just make a new folder in /models called mmaudio and put all the files there!
hello sir, everything was working perfect just the other day, but now its not working. i think there was a node update. (im new) i would sincerley appreciate a workflow update. thank you for your hard work as always
try updating ComfyUI to the latest version
It feels like the model recognizes 2.5D, 3D, and realistic input images much better than pure 2D ones. What do you think?
Probably, I wouldn't reasonably expect the training set to include a lot of 2D content, mostly real-life videos if anything
So I use the smoothing WF to create my wan2.2 animations. Does anyone have a WF that incorporates mmaudio smoothly into a full gen workflow, rather than a standalone WF to run?
You can just copy the content of the workflow and add it after your VAE decode. So your decoded images will get passed into the 'input of the MMAudio sampler
@SeoulSeeker Cool ty. Would the prompt for the wan animation ever be adequate for the prompt for the audio? Or will that always require a manual/independent prompt?
@Delavestra best practice would probably be an independent prompt I’d say. You really want to provide a varied and complete spectrum of what would be heard in the audio. I use two wildcard nodes for oral sex and regular sex, I might just update the workflow description with that info because it works pretty well and basically eliminates the need to type a prompt
It keeps giving me this error: MMAudioFeatureUtilsLoader
BigVGAN._from_pretrained() missing 2 required keyword-only arguments: 'proxies' and 'resume_download'
Any tip?
Do you have a folder called 'bigvgan_v2_44khz_128band_512x' in /models/mmaudio?
@SeoulSeeker yes
I'm seeing the same issue. This is new. This workflow worked for me about a month ago.
I also have nvidia/bigvgan_v2_44khz_128band_512x
Fixed by cloning https://github.com/kijai/ComfyUI-MMAudio in custom_nodes
@dolphinblowgal382 Still having the issue after cloning https://github.com/kijai/ComfyUI-MMAudio in custom_nodes
@elmejor9369 a newer comment from a person with the same problem just said 'yah i had to manualy clone the MMaudio repo not install from manager'
@SeoulSeeker -nope, still nothing, i've clone it manually and still gives the same error.
@elmejor9369 are you on a new-ish Comfy version? Try updating to the latest stable version
Here's the fix, download the v3 workflow and the instructions are in the folder: https://civitai.com/models/2105650/wan-22-svi-2-pro-nsfw-workflow-with-10-steps-i2v-model-long-videos-up-to-1-minute-nsfw-loras-fp8-gguf-upscale-with-interpolation-mmaudio
i keep getting
BigVGAN._from_pretrained() missing 2 required keyword-only arguments: 'proxies' and 'resume_download'
i did download all 4 models clip vae and such
yah i had to manualy clone the MMaudio repo not install from manager
@migero731 thank you for replying with your fix!
@SeoulSeeker np nothing like learning from your mistakes :D
btw is 100 samples ok ? it seems to produce bad sounds 300 seems better to me
@migero731 it all depends on the seed and your prompt, ideally higher steps should be better but there are often diminishing returns and sometimes it can overcook the result. i usually do 50-100 steps myself
Can I change the voice, or is it only randomly selected via a seed?
Keeping a fixed seed should help a little, but I haven't found a way to reliably control the voice
i created the mmaudio folder and placed 4 files in there. however the nodes are still missing? did i do something wrong?
Me too. Reinstall via custom notes manager also same error.
You need to install all missing nodes from 'comfyui custom note manager'.
Ok. I finally fix use this command.
1. Open 'python_embeded' folder Ex: " C:\ComfyUI\ComfyUI-Easy-Install\python_embeded "
2. type cmd in 'File Explorer' address bar on the top. Ex: replace " C:\ComfyUI\ComfyUI-Easy-Install\python_embeded " into " cmd " & then enter.
3. Paste this command. Ex: "python.exe -m pip install -r C:\ComfyUI\ComfyUI-Easy-Install\ComfyUI\custom_nodes\comfyui-mmaudio\requirements.txt"
Answer: BigVGAN._from_pretrained() missing 2 required keyword-only arguments: 'proxies' and 'resume_download'
Ok so anyone really new to this like me that don't understand what people might consider basic (clone), here is a fix to the above error.
open your comfy directory, go into custom nodes (if your using Pinokio like me: C:\pinokio\api\comfy.git\app\custom_nodes) find and delete "ComfyUI-MMAudio". (if its not there, then skip that step).
Then, right click anywhere in the window within the custom_nodes window and select "open in terminal" to open command prompt. Type in "git clone https://github.com/kijai/ComfyUI-MMAudio"
(this should then install). It worked fine for me after this.
If your still having issues, check that the requirements are installed.
"pip install -r requirements.text" within folder.
Hope this helps (sorry if this isn't helpful, This error been driving me nuts).
im having issue on step 2 mmaudio
free_upper_bound + pytorch_used_bytes[device] <= device_total INTERNAL ASSERT FAILED at "C:\\actions-runner\\_work\\pytorch\\pytorch\\pytorch\\c10\\cuda\\CUDAMallocAsyncAllocator.cpp":563, please report a bug to PyTorch.
looks like an OOM error to me. try clearing your model/execution cache and run it again with other programs closed
You just uploaded a video that was too long.
@stenny13street654 yes, i tried with a 5 second video and worked
Works fairly well! I wonder how hard it is to fine-tune the mmaudio model some more, feels like it could be quite a bit better. Thanks for uploading and explaining
an error MMAudioModelLoader
Error(s) in loading state_dict for MMAudio: Missing key(s) in state_dict: "clip_input_proj.2.w1.weight", "clip_input_proj.2.w2.weight", "clip_input_proj.2.w3.weight", "text_input_proj.2.w1.weight", "text_input_proj.2.w2.weight", "text_input_proj.2.w3.weight". Unexpected key(s) in state_dict: "clip_input_proj.1.w1.weight", "clip_input_proj.1.w2.weight", "clip_input_proj.1.w3.weight", "text_input_proj.1.w1.weight", "text_input_proj.1.w2.weight", "text_input_proj.1.w3.weight". size mismatch for t_embed.mlp.0.weight: copying a param with shape torch.Size([896, 256]) from checkpoint, the shape in current model is torch.Size([896, 896]).
Same error here, I had to install official model
@la440soundproject752 I have already solved this problem. It's an issue with the plugin environment. You can send this error report to DeepSeek or other AI, and it will guide you step by step on how to install the environment
Is there a way to run it with 8GB VRAM? If so, how should I config it?
You can just give it a try. I can't say whether it will work, but worth a shot
Can someone tell me what caused this error?"An error happened while trying to locate the files on the Hub and we cannot find the appropriate snapshot folder for the specified revision on the local disk. Please check your internet connection and try again."
## Stack Trace ``` File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\ComfyUI\execution.py", line 515, in execute output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\ComfyUI\execution.py", line 329, in get_output_data return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\ComfyUI\execution.py", line 303, in _async_map_node_over_list await process_inputs(input_dict, i) File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\ComfyUI\execution.py", line 291, in process_inputs result = f(**inputs) ^^^^^^^^^^^ File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\ComfyUI\custom_nodes\ComfyUI-MMAudio-main\nodes.py", line 243, in loadmodel snapshot_download( File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\python\Lib\site-packages\huggingface_hub\utils\_validators.py", line 114, in _inner_fn return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "D:\ComfyUI-aki-v2\ComfyUI-aki-v2\python\Lib\site-packages\huggingface_hub\_snapshot_download.py", line 248, in snapshot_download raise LocalEntryNotFoundError(Lip syncing is virtually non-existent with the fine-tune. Moans are driven almost entirely by body motion. There's still plenty (and it is fun) to work with, but it severely limits it's usefulness
Works great. Thanks for sharing !
What prompt have you used in mmaudio sampler node? The prompt shows on demo video is for wan i think.
I use a wildcard node with two different prompt blocks for oral sex and regular sex. These can easily be modified with an LLM like Grok
Oral sex:
Generate intimate, rhythmic sounds: {wet skin slapping|throbbing flesh slapping|slick body impacts} at a {slow sensual pace|steady moderate pace|building fast pace}, syncing with thrusts, {sensual female orgasmic moans that rise and fall|breathy female gasps building to cries|playful female whimpers turning to moans} with {rising intensity|teasing build-up|explosive peaks}, faint {squelching with each movement|wet friction sounds|slick glide noises}. Add {breathy whispers of dirty talk|soft bed creaks under motion|fabric rustle against skin}. Keep it close-mic’d and natural, with {subtle echo in room|raw unfiltered intimacy|layered breathy reverb}.
Regular sex:
Generate intimate oral sex sounds: {wet slurping on flesh|sloppy sucking noises|deepthroat glide sounds} at a {slow teasing pace|steady rhythmic pace|building intense pace}, syncing with bobbing motion, {sensual female moans muffled around shaft|breathy female hums building to gasps|playful female gagging turning to whimpers} with {rising intensity|teasing build-up|explosive peaks}, prominent {saliva drooling and dripping|wet suction pops|throat gurgles and licks}. Add {breathy dirty talk encouragements|soft hand stroking sounds|fabric rustle or knee shifts}. Keep it close-mic’d and natural, with {subtle room ambiance|raw wet intimacy|layered breathy reverb}.
@SeoulSeeker thanks! I will try it
I'm getting:
RIFEInterpolation
module 'torch' has no attribute 'nullcontext'
Any ideas? Thanks in advance.
Looks like your PyTorch might be out of date. You might need a fresh Comfy install, or update PyTorch individually (might cause problems)
But I just installed Comfy, and this workflow was working fine last week.
This workflow used to work for me about a month ago but now when I tried using it a bunch of nodes no longer worked (MMAudioLoader, FeatureUtilLoader). I tried updating in the manager settings and rolling back to previous versions but that didn't seem, to work. I think this issue occurred after the major ComfyUI update.
Update Comfyui to v0.18.5. I recommend to clone/copy your old installation first. then navigate to comfyui/user folder and search for the config.inni. Edit the config innis security level to: = weak. Save the file so you can install the MMAuudio-Suite from Kijai otherwhise Comfyui blocks the installation. Install first ComfyUI-MMAudio from KJ. You ll notice the nodes dont show up. Then install ComfyUI-MMAudio-Suite from Takenoko3333 ontop. Close Comfyui and change the security level back to: = normal. hope this helps m8.
@Weaze Changing the security level worked! Thank you so much!
Is it possible to only get the sounds of the bodies interacting? I don't want any voices, moaning etc.
You can try prompting for voices/moaning/etc, though I think the dataset for the nsfw finetune is pretty heavily biased towards vocalization
Hey MMAUDIO generates only sound effects if im right, is there any chance to make them tell, speak, say what im wrote in the promt?
Not with this workflow/model. You'd probably want to try LTX 2.3 for that
If some of you want a very cool trick with mmaudio I used on some of my video
So first you will need an editing software or maybe you can do it with comfui but I don't know how.
So because MMaudio generate sounds synced to motion, most of the time the model get pretty confused if you have too much going on or your subject/action is too small/far from the "lens"
So the trick is to edit your video, crop it on the part you want and zoom into it to only have one single and clear motion, face, expression, etc.. for the model to understand
Then put this edited video on the mmaudio gen and most of the time it is way better like that.
And then of course you will need to edit the video with the audio track, it's a bit extra work but I preafer that than playing " please give me a good seed" game
Great idea, thank you for the comment!
@SeoulSeeker your welcome, I'v posted a workflow with this idea, using ultra lyrics detector to crop the face and generate only with this part if you want to check.
For some reason, though I installed everything, it still can't find these nodes:
- MMAudioFeatureUtilsLoader
- MMAudioModelLoader
- MMAudioSampler
Any ideas why that would be?
Transformers are limited to version 4.x (e.g., 4.45.0 or 4.46.0). Try changing it.