Skip to content

Multimodal Tutorial

oobabooga edited this page Aug 12, 2025 · 1 revision

Getting started

1. Find a multimodal model

GGUF models with vision capabilities are uploaded along a mmproj file to Hugging Face.

For instance, unsloth/gemma-3-4b-it-GGUF has this:

print1

2. Download the model to user_data/models

As an example, download

https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/gemma-3-4b-it-Q4_K_S.gguf?download=true

to your text-generation-webui/user_data/models folder.

3. Download the associated mmproj file to user_data/mmproj

Then download

https://huggingface.co/unsloth/gemma-3-4b-it-GGUF/resolve/main/mmproj-F16.gguf?download=true

to your text-generation-webui/user_data/mmproj folder. Name it mmproj-gemma-3-4b-it-F16.gguf to give it a recognizable name.

4. Load the model

  1. Launch the web UI
  2. Navigate to the Model tab
  3. Select the GGUF model in the Model dropdown:
print2
  1. Select the mmproj file in the Multimodal (vision) menu:
print3
  1. Click "Load"

5. Send a message with an image

Select your image by clicking on the 📎 icon and send your message:

print5

The model will reply with great understanding of the image contents:

print6

Multimodal with ExLlamaV3

Multimodal also works with the ExLlamaV3 loader (the non-HF one).

No additional files are necessary, just load a multimodal EXL3 model and send an image.

Examples of models that you can use:

Multimodal API examples

In the page below you can find some ready-to-use examples:

Multimodal/vision (llama.cpp and ExLlamaV3)

Clone this wiki locally