🎬 AI Music Video Workflow (ComfyUI)
Turn your favorite tracks into fully AI-generated cinematic music videos — automatically right inside ComfyUI - NO POST EDITING NEEDED.
This workflow takes a reference image and an audio file, then generates a lip synced video that matches lyrics, mood, and scene dynamics and is 95% fully automated.
For some reason most of the example videos are not showing up, guessing they are too long. You can find them all here
High Level walkthrough can be found here
Need help or have questions? Please reach out through discord
✨ What It Does
🎠Keeps your reference image as the main performer across all scenes.
🎶 Splits audio into lyric-synced snippets for perfect timing.
🖋️ Uses a custom Prompt creator node that sends custom instructions to an LLM node to build cinematic prompts from the lyrics and your style choices.
🎥 Generates scene-by-scene visuals, then combines them into a seamless final video.
The samples I provided were all created inside ComfyUI with NO post edits.
On a 5090 it took around 2 hours for the full song.
More examples can be found here and more will be added as I make them.
đź”§ Key Features
Reference Image Control – Import your character photo (headshot recommended) it auto-removes the background, and resizes for clean framing.
Audio Handling – Automatic vocal/instrument separation, Whisper V3 transcription, advanced settings for lyric overlap, and fallback options.
Prompt Creator – Flexible scene builder with fields for style, theme, lighting, camera motion, outfits, and more to get a custom look
Auto Queueing – Handles multi-run videos seamlessly for long audio files.
Final Render Automation – Collects all video chunks, merges them, and saves your finished video as
FINAL_VIDEO.mp4.This workflow uses the Native Gemini LLM API node by default, which receives detailed instructions generated by the Prompt Creator node. You can swap Gemini out for another LLM if you prefer, but the instruction sets are fairly complex, and most local models struggle to follow them reliably. If you’d rather not use an LLM at all, you can manually enter prompts instead—just reach out on Discord for extra guidance and tips. For context, I’ve spent only $5 so far, which has powered 50+ videos, and I still have credit left—so it’s been very cost-effective.
🚀 Quick Start
Upload reference image
Load your audio file
Set your folder name (e.g., the song title).
Fill in Prompt Creator fields (style, mood, shots, etc.).
Hit Run — everything else is automated.
The workflow will auto-queue middle runs for long audio files.
For the final pass, it will tell you which groups to mute.
Simply follow the on-screen instructions, hit run again, and the workflow finishes the process automatically. (You do not have to wait for runs to finish. You just mute and hit run once more.)
🎵 Creative Workflow Tip
Just like real music videos, you don’t have to stick to one pass. You can run the same audio file multiple times with different reference images or styles — for example:
One pass with the lead singer as the performer.
Another pass featuring a band member or supporting character.
Additional passes experimenting with different themes, outfits, or camera styles.
Later, you can edit these separate video runs together, cutting between performances or blending visual moods — exactly how professional music videos are produced with multiple takes.
📦 Required Custom Nodes
This workflow relies on a set of custom nodes I built specifically for this workflow.
You’ll need to install them before running the workflow:
👉 ComfyUI-VRGameDevGirl Custom Nodes (GitHub)
They can also be installed via the manager.
These nodes handle:
Audio splitting, transcription, and auto-queueing
Smart folder management and metadata tracking.
Popup instructions for multi-run projects.
Scene sync and frame adjustments for HuMo compatibility.
Video combining and more.
👉 Join the discord community for support, tips and tricks.
âś… In Summary
This workflow is designed for creators, musicians, and visual storytellers who want to merge AI visuals with music. With automatic transcription, smart prompt handling, and seamless video assembly, you can focus on creative direction while the workflow handles the heavy lifting.
Description
FAQ
Comments (15)
That looks amazing, I want to try it out but cant spend the cash, will try to switch the Gemini node to a local LLM node, wish it works :)
Its sooo cheap. $5 will last you many, many videos. I put in $5 and made over 50 videos. local models are not smart enough to handle the complex instructions. You could manually prompt though, just find the text encoder nodes in the groups, unpin them, expand them and unplug the noodle from the input. Then you can just disable auto Q and run each set manually using manually prompts or prompts from chat GPT. I'm working on some new nodes though that will allow you to just use GPT's and won't need an LLM. Just an extra step or two but works.
@vrgamedevgirl It is cheap, but I'm cheaper :) made a pledge to myself to try to do everything locally after spending too much on GPU. anyways, your'e right about local models, I'm using qwen3-4b to keep as much VRAM available and pushing it twice (second time asking it to adhere to formatting) seems to work OK. This workflow is great though, really inspiring me to try to build some of those longer WF's i've been to lazy to build.
@vrgamedevgirl would love a gpt version of this - interesting concept :)
@dannyboy33Â could you share your workflow for this? i would prefer using a local model as well.
@veldierin The workflow is in the workflows folder when you get the custom nodes. - There is a new manual workflow that does not use an LLM but instead you can use GPT's i created. There is the free version of chat GPT.
Incredible work. The workflow is something of a mess when not having done it myself. Finding my way around is hard, but man... This is very good!!!
Thanks!!! Also, If you follow the video walkthrough its very easy. Its not a mess per say its just very complex.
and you don't have to even look at most of the nodes. Just steps 1-4 and then one or two nodes need to be touched. Its a well organized workflow, you just need to watch the walkthrough video.
@vrgamedevgirl Love it :)
Thats a revolution Dorothy ! About the batch image node for the Gemini input, do we have to create one or is it somewhere in this oceanic workflow ?
you don't need to connect any images to the gemini node. You don't really have to do anything besides the main ref image, song and folder, then make sure you mute groups when needed. i would reach out on discord for support as i'm there pretty much every day.
the link to the server is in the description.
and thanks!!! :)
This is so epic, thank you for spending the time to make it. I do have an issue, no matter what I do I can't seem to find the GeminiNode - I've even installed a few off of github but to no avail. I'd like to try running this from a local LLM - and I'm not good at comfyui nodes (I'll break it) - how would I use a local LLM with this? Much appreciated and than you!
Hey! I'm guessing you need to update comfyUI to get the gemini node. You can use other LLM nodes but the instructions are very complex and most smaller models just can't handle it. I would recommend reaching out to discord, link to server in description.
whoa! how did i miss this. this is so promising, i hope it works for my setup!
nevermind figured it out, only problem now is im not using a llm currently, but it would be awesome to use llm studio if that was possible and just have the gen pause after it generated all the promps so you can dump it from memory but anyways can't figure out how to edit promps without a llm.