Replies: 1 comment
-
There could be different causes:
Could you provide a basic example using Try use this and see if the results are good: llama-server -m model.gguf --pooling last
curl http://localhost:8080/embeddings -H "Content-Type: application/json" -d '{"input": ["hello"]}' |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Can someone help me understand llama.cpp modifications needed to support Llama3-based LLM2Vec?
Background
We have a project making use of Llama3-based LLM2Vec as text embedding for a text-to-content model. While the model is trained with the original Llama3-based LLM2Vec on HF, we would like to make the deployment easier by running LLM2Vec with llama.cpp.
AFAIK, LLM2Vec should be just finetuning on top of the original LLM architecture (Llama3 in our case) but with bidirectional attention. Because llama.cpp supports Llama3, I am hoping running Llama3-based LLM2Vec with llama.cpp is not too much work.
What has been done so far
convert_hf_to_gguf.py
like below to set the flag for non-causal attentionLLM_ARCH_LLAMA
be aware of the optional causal attention flag inllama-model.cpp
Observations
Let me know what I am missing here and/or ideas to try.
Beta Was this translation helpful? Give feedback.
All reactions