What it takes to support bidirectional Llama3 for LLM2Vec #13368

nv-jsaito · 2025-05-08T00:31:20Z

nv-jsaito
May 8, 2025

Can someone help me understand llama.cpp modifications needed to support Llama3-based LLM2Vec?

Background

We have a project making use of Llama3-based LLM2Vec as text embedding for a text-to-content model. While the model is trained with the original Llama3-based LLM2Vec on HF, we would like to make the deployment easier by running LLM2Vec with llama.cpp.

AFAIK, LLM2Vec should be just finetuning on top of the original LLM architecture (Llama3 in our case) but with bidirectional attention. Because llama.cpp supports Llama3, I am hoping running Llama3-based LLM2Vec with llama.cpp is not too much work.

What has been done so far

Merged LLM2Vec finetuned parameters to the original Meta-Llama-3-8B-Instruct and saved as an HF model
- I confirmed this merged model outputs the same exact results as the original non-merged model
Converted the above to GGUF with a modified convert_hf_to_gguf.py like below to set the flag for non-causal attention

@Model.register("LlamaBiModel")
class LlamaBiModel(LlamaModel):
    model_arch = gguf.MODEL_ARCH.LLAMA
    def set_gguf_parameters(self):
        super().set_gguf_parameters()
        self.gguf_writer.add_causal_attention(False)

Made LLM_ARCH_LLAMA be aware of the optional causal attention flag in llama-model.cpp

Observations

The embeddings from above seems to be garbage
Interestingly, when I turn on the causal attention, the model performs close to expectations while losing some details in the text prompts
With or without the causal attention, the embeddings are surely significantly off numerically from the original HF model

Let me know what I am missing here and/or ideas to try.

ggerganov · 2025-05-22T05:37:18Z

ggerganov
May 22, 2025
Maintainer

There could be different causes:

Tokenization problem
Incorrect prompt format
Incorrect pooling type (most likely)
etc.

Could you provide a basic example using llama-server + curl where you generate the embeddings of some text and it does not match a reference result that you expect? This will help find the root cause of the issue.

Try use this and see if the results are good:

llama-server -m model.gguf --pooling last

curl http://localhost:8080/embeddings -H "Content-Type: application/json" -d '{"input": ["hello"]}'

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What it takes to support bidirectional Llama3 for LLM2Vec #13368

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

What it takes to support bidirectional Llama3 for LLM2Vec #13368

Uh oh!

nv-jsaito May 8, 2025

Background

What has been done so far

Observations

Replies: 1 comment

Uh oh!

ggerganov May 22, 2025 Maintainer

nv-jsaito
May 8, 2025

ggerganov
May 22, 2025
Maintainer