Gemma 4 Text Generation ComfyUI workflow | Image-Text-Audio Analysis Tool

Transforms visuals and audio into intelligent, coherent text outputs fast.

Who it's for: creators who want this pipeline in ComfyUI without assembling nodes from scratch. Not for: one-click results with zero tuning - you still choose inputs, prompts, and settings.

Open preloaded workflow on RunComfy

Open preloaded workflow on RunComfy (browser)

Why RunComfy first
- Fewer missing-node surprises - run the graph in a managed environment before you mirror it locally.
- Quick GPU tryout - useful if your local VRAM or install time is the bottleneck.
- Matches the published JSON - the zip follows the same runnable workflow you can open on RunComfy.

When downloading for local ComfyUI makes sense - you want full control over models on disk, batch scripting, or offline runs.

How to use (local ComfyUI)
1. Load inputs (images/video/audio) in the marked loader nodes.
2. Set prompts, resolution, and seeds; start with a short test run.
3. Export from the Save / Write nodes shown in the graph.

Expectations - First run may pull large weights; cloud runs may require a free RunComfy account.

Overview

This workflow empowers you to create coherent text outputs guided by visual, audio, and video cues. You can analyze media, summarize reviews, or prototype lightweight chatbots with accurate context grounding. It integrates ComfyUI nodes for text, CLIP, and transcription tasks seamlessly. The setup boosts efficiency in LLM testing and multimodal research. Ideal for designers and developers seeking fast, context-aware AI text generation.

Important nodes:

Key nodes in Comfyui Gemma 4 Text Generation ComfyUI workflow

TextGenerate (#1)
Drives the final output and is where most tuning lives. Adjust how long the response can be and how exploratory it should feel by changing the maximum tokens and sampling temperature. Enable the optional reasoning mode if you want more step‑by‑step thinking before the answer. For implementation details, see the ComfyUI text generation node source code here.
CLIPLoader (#3)
Selects and loads the Gemma 4 E4B encoder package needed for text and multimodal understanding. If you maintain models locally, place the file under:
ComfyUI/models/text_encoders/gemma4_e4b_it_fp8_scaled.safetensors
After selection, you rarely need to revisit this node unless you switch model variants.
GetVideoComponents (#7)
Useful when you want the model to consider video. It exposes frames and audio so you can condition TextGenerate on both. If your clip is long, choose a smaller set of frames for faster turnaround; if you need finer detail, increase the frame sampling at the cost of speed.

Notes

Gemma 4 Text Generation ComfyUI workflow | Image-Text-Audio Analysis Tool - see RunComfy page for the latest node requirements.

Open preloaded workflow on RunComfy

Overview

Key nodes in Comfyui Gemma 4 Text Generation ComfyUI workflow

Notes

Description

FAQ

Details

Files

gemma4TextGenerationComfyui_v10.zip

Mirrors

Open preloaded workflow on RunComfy

Overview

Key nodes in Comfyui Gemma 4 Text Generation ComfyUI workflow

Notes

Description

FAQ

What is Gemma 4 Text Generation ComfyUI workflow | Image-Text-Audio Analysis Tool?

What files are available and where can I download them?

Details

Files

gemma4TextGenerationComfyui_v10.zip

Mirrors