Multimodal: Submit list/batch of images (aka "video") to SmolVLM-2 Video model #13672
Unanswered
omarwahby-telestream
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Good evening all, I hope this message finds everyone well!
I’m a user of llama.cpp who is interested in using the SmolVLM-2 model in llama.cpp to do inference from a video file. While searching for a way to do this, I found this python example using the SmolVLM-2 model to generate a highlight reel from a video:
https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator/blob/main/app.py
The prompt below simply passes the path to the video file as the "video" parameter, which is then processed by the python transformers module.
https://huggingface.co/spaces/HuggingFaceTB/SmolVLM2-HighlightGenerator/blob/main/app.py#L58
Within the python transformers module, this video file is handled here:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/smolvlm/processing_smolvlm.py#L131
SmolVLM first samples frames from the input video, then preprocesses them, and finally sends the frames along with the prompt text for inference.
I would like to do the same thing with llama-server, i.e. pass a text prompt and a video file to this model and obtain an output, i.e. a text description of what is happening in the video file.
Is there currently a way to do this in llama-server? Does this require development on the llama.cpp side? I would greatly appreciate any insights into how to get this to work or any pointers to get me in the right direction. Thank you very much for your help!
Update (5/21/2025):
I also saw in this PR linked below that there may already be support in llama-server for sending a sequence of multiple images along with a text prompt in one API call to llama-server: #13050, but I'm not able to find any documentation on the correct format to send a sequence of images in a valid request format.
Beta Was this translation helpful? Give feedback.
All reactions