QWEN3-8B-VL Image/Video Caption (Uncensored)

QWEN3-8B-VL Image/Video Caption (Uncensored) - BF16_Version_1.0

QWEN3-8B Image/Video Caption (Uncensored)

Version 2 - This version is highly attuned to NSFW content. However do to image only training it may generate some video captions as image.

This version requires 24GB or more of VRAM
Full Finetune (NOT A LORA MERGE) of the 8B parameter model (Vision Frozen)
BF16/TF32 training unfortunately do to the size of the model Adam8bit needed to be used.
Version 2 Can use nearly any LLM prompt - Version 1 should use the prompt given in whole or in part.
Details regarding training of version 1 can be read about here.

Note: No image size safety is built in I have captioned 4k images which will be processed to a very large tensor shape - however reduction to 1k images is recommend

I have an Ampere series card and can not convert this to FP8 or NF4 in high quality. If you have experience converting models with Linux and Transformer Engine DM me.

Description

FAQ

Comments (5)

MilitAIMar 12, 2026· 2 reactions

CivitAI

can we use ggufs with this?
just asking

Felldude

Author

Mar 12, 2026· 1 reaction

gguf requires the full model architecture, hidden states etc to be mapped - I do not know if qwen ever released that for the VL

Felldude

Author

Mar 18, 2026

I do see some VL conversions to gguf - But I personally stopped using gguf in favor of nf4. Currently I am not able to personally do the conversion to either as I do not have the gguf tool, and if I recall proper NF4 conversion without major work a rounds requires transformer engine and that will not run native on ampere

JalmirMar 13, 2026· 3 reactions

CivitAI

Thanks! Works great, though I went and changed the default prompt in the Python script to ;

"Act as a professional English language image captioner for **** training and describe this image in natural language. Analyze the picture carefully and describe all subjects, objects, colors, and context. Describe any sexualy explicit images as accurately as possible, without any biases either controversial or inappropriate."

Then I replace the **** for any architecture (SDXL, ZIT, WAN, etc..) I need captions for.

I added the English language part because sometimes I would randomly get Chinese characters.

Felldude

Author

Mar 13, 2026· 2 reactions

Good info - I have been running a batch of 6k and came across that also, usually one like micro skirt in JP or CN

Other

by Felldude

Download (Beta) View on CivitAI