Multimodal: Submit list/batch of images (aka "video") to SmolVLM-2 Video model #13672

omarwahby-telestream · 2025-05-21T01:45:48Z

omarwahby-telestream
May 21, 2025

Good evening all, I hope this message finds everyone well!

I’m a user of llama.cpp who is interested in using the SmolVLM-2 model in llama.cpp to do inference from a video file. While searching for a way to do this, I found this python example using the SmolVLM-2 model to generate a highlight reel from a video:
https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator/blob/main/app.py

The prompt below simply passes the path to the video file as the "video" parameter, which is then processed by the python transformers module.
https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator/blob/main/app.py#L58

    {
        "role": "user",
        "content": [
            {"type": "video", "path": video_path},
            {"type": "text", "text": "What type of video is this and what's happening in it? Be specific about the content type and general activities you observe."}
        ]
    }

Within the python transformers module, this video file is handled here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/smolvlm/processing_smolvlm.py#L131

SmolVLM first samples frames from the input video, then preprocesses them, and finally sends the frames along with the prompt text for inference.

I would like to do the same thing with llama-server, i.e. pass a text prompt and a video file to this model and obtain an output, i.e. a text description of what is happening in the video file.

Is there currently a way to do this in llama-server? Does this require development on the llama.cpp side? I would greatly appreciate any insights into how to get this to work or any pointers to get me in the right direction. Thank you very much for your help!

Update (5/21/2025):

I also saw in this PR linked below that there may already be support in llama-server for sending a sequence of multiple images along with a text prompt in one API call to llama-server: #13050, but I'm not able to find any documentation on the correct format to send a sequence of images in a valid request format.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multimodal: Submit list/batch of images (aka "video") to SmolVLM-2 Video model #13672

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Multimodal: Submit list/batch of images (aka "video") to SmolVLM-2 Video model #13672

Uh oh!

Uh oh!

omarwahby-telestream May 21, 2025

Replies: 0 comments

omarwahby-telestream
May 21, 2025