Add Llama CPP

danielfleischer · danielfleischer · commit 7cb9f3da9763 · 2023-12-24T17:35:41.000+02:00
* added a Llama CPP invocation layer.
* Readme section.
* Tutorial notebook
diff --git a/README.md b/README.md
@@ -53,6 +53,10 @@ For a brief overview of the various unique components in fastRAG refer to the [C
     <td><a href="components.md#fastrag-running-llms-with-onnx-runtime">ONNX Runtime</a></td>
     <td><em>Running LLMs with optimized ONNX-runtime</td>
   </tr>
+  <tr>
+    <td><a href="components.md#fastrag-running-rag-pipelines-with-llms-on-a-llama-cpp-backend">Llama-CPP</a></td>
+    <td><em>Running RAG Pipelines with LLMs on a Llama CPP backend</td>
+  </tr>
   <tr>
     <td colspan="2"><strong><em>Optimized Components</em></td>
   </tr>
diff --git a/components.md b/components.md
@@ -146,6 +146,37 @@ PrompterModel = PromptModel(
 )
 ```
 
+## fastRAG Running RAG Pipelines with LLMs on a Llama CPP backend
+
+To run LLM effectively on CPUs, especially on client side machines, we offer a method for running LLMs using the [llama-cpp](https://github.com/ggerganov/llama.cpp).
+We recommend checking out our [tutorial notebook](examples/client_inference_with_Llama_cpp.ipynb) with all the details, including processes such as downloading GGUF models.
+
+### Installation
+
+Run the following command to install our dependencies:
+
+```
+pip install -e .[llama_cpp]
+```
+
+For more information regarding the installation process, we recommend checking out the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) repository.
+
+
+### Loading the Model
+
+Now that our model is downloaded, we can load it in our framework, by specifying the ```LlamaCPPInvocationLayer``` invocation layer.
+
+```python
+PrompterModel = PromptModel(
+    model_name_or_path= "models/marcoroni-7b-v3.Q4_K_M.gguf",
+    invocation_layer_class=LlamaCPPInvocationLayer,
+    model_kwargs= dict(
+        max_new_tokens=100
+    )
+)
+```
+
+
 ## Optimized Embedding Models
 
 Bi-encoder Embedders are key components of Retrieval Augmented Generation pipelines. Mainly used for indexing documents and for online re-ranking. We provide support for quantized `int8` models that have low latency and high throughput, using [`optimum-intel`](https://github.com/huggingface/optimum-intel) framework.
diff --git a/examples.md b/examples.md
@@ -7,7 +7,8 @@
 | RAG pipeline with FiD generator | [:notebook_with_decorative_cover:](examples/fid_promping.ipynb) |
 | RAG pipeline with REPLUG-based generator | [:notebook_with_decorative_cover:](examples/replug_parallel_reader.ipynb) |
 | RAG pipeline with LLMs running on Gaudi2 |[:notebook_with_decorative_cover:](examples/inference_with_gaudi.ipynb) |
-| RAG pipeline with quantized LLMs running on ONNX-running backend | [:notebook_with_decorative_cover:](examples/inference_with_gaudi.ipynb) |
+| RAG pipeline with quantized LLMs running on ONNX-running backend | [:notebook_with_decorative_cover:](examples/rag_with_quantized_llm.ipynb) |
+| RAG pipeline with LLMs running on Llama-CPP backend | [:notebook_with_decorative_cover:](examples/client_inference_with_Llama_cpp.ipynb) |
 | Optimized and quantized Embeddings models for retrieval and ranking | [:notebook_with_decorative_cover:](examples/optimized-embeddings.ipynb) |
 | RAG pipeline with PLAID index and ColBERT Ranker | [:notebook_with_decorative_cover:](examples/plaid_colbert_pipeline.ipynb) |
 | RAG pipeline with Qdrant index | [:notebook_with_decorative_cover:](examples/qdrant_document_store.ipynb) |
diff --git a/examples/client_inference_with_Llama_cpp.ipynb b/examples/client_inference_with_Llama_cpp.ipynb
diff --git a/fastrag/prompters/__init__.py b/fastrag/prompters/__init__.py
@@ -3,5 +3,6 @@
 from fastrag.prompters.invocation_layers.gaudi_hugging_face_inference import (
     GaudiHFLocalInvocationLayer,
 )
+from fastrag.prompters.invocation_layers.llama_cpp import LlamaCPPInvocationLayer
 from fastrag.prompters.invocation_layers.ort import ORTInvocationLayer
 from fastrag.prompters.invocation_layers.vqa import VQAHFLocalInvocationLayer
diff --git a/fastrag/prompters/invocation_layers/llama_cpp.py b/fastrag/prompters/invocation_layers/llama_cpp.py
@@ -0,0 +1,117 @@
+import logging
+import sys
+from typing import Dict, List, Optional, Union
+
+from haystack.lazy_imports import LazyImport
+from haystack.nodes.prompt.invocation_layer import PromptModelInvocationLayer
+from haystack.nodes.prompt.invocation_layer.hugging_face import HFLocalInvocationLayer
+
+with LazyImport("Install llama_cpp using 'pip install -e .[llama_cpp]'") as llama_cpp_import:
+    from llama_cpp import Llama
+
+logger = logging.getLogger(__name__)
+
+
+class LlamaCPPInvocationLayer(HFLocalInvocationLayer):
+    """
+    A subclass of the PromptModelInvocationLayer class. It loads a pre-trained model from Hugging Face,
+    and loads it into an HPU device, including ad-hoc optimizations.
+    """
+
+    def __init__(
+        self,
+        model_name_or_path: str = "llama-model.gguf",
+        max_length: int = 100,
+        use_auth_token: Optional[Union[str, bool]] = None,
+        **kwargs,
+    ):
+        PromptModelInvocationLayer.__init__(self, model_name_or_path)
+
+        self.llm = Llama(model_path=model_name_or_path)
+        self.max_length = max_length
+        self.max_new_tokens = kwargs.get("max_new_tokens", 100)
+
+        # Additional properties for Invocation Layer requirements
+        self.model_max_length = kwargs.get("model_max_length", sys.maxsize)
+        self.generation_kwargs = kwargs
+
+    def _ensure_token_limit(
+        self, prompt: Union[str, List[Dict[str, str]]]
+    ) -> Union[str, List[Dict[str, str]]]:
+        """Ensure that the length of the prompt and answer is within the max tokens limit of the model.
+        If needed, truncate the prompt text so that it fits within the limit.
+
+        :param prompt: Prompt text to be sent to the generative model.
+        """
+        model_max_length = self.model_max_length
+        tokenized_prompt = self.llm.tokenize(bytes(prompt, "utf-8"))
+        n_prompt_tokens = len(tokenized_prompt)
+        n_answer_tokens = self.max_length
+        if (n_prompt_tokens + n_answer_tokens) <= model_max_length:
+            return prompt
+
+        logger.warning(
+            "The prompt has been truncated from %s tokens to %s tokens so that the prompt length and "
+            "answer length (%s tokens) fit within the max token limit (%s tokens). "
+            "Shorten the prompt to prevent it from being cut off",
+            n_prompt_tokens,
+            max(0, model_max_length - n_answer_tokens),
+            n_answer_tokens,
+            model_max_length,
+        )
+
+        decoded_string = self.llm.detokenize(
+            tokenized_prompt[: model_max_length - n_answer_tokens]
+        ).decode("utf-8")
+        return decoded_string
+
+    def invoke(self, *args, **kwargs):
+        """
+        It takes a prompt and returns a list of generated texts using the local Hugging Face transformers model
+        :return: A list of generated texts.
+
+        Note: Only kwargs relevant to Text2TextGenerationPipeline and TextGenerationPipeline are passed to
+        Hugging Face as model_input_kwargs. Other kwargs are ignored.
+        """
+        output: List[Dict[str, str]] = []
+        stop_words = kwargs.pop("stop_words", [])
+
+        generated_texts = []
+        if kwargs and "prompt" in kwargs:
+            prompt = kwargs.pop("prompt")
+
+            generation_kwargs = self.generation_kwargs
+            model_input_kwargs = {
+                key: kwargs[key]
+                for key in [
+                    "return_tensors",
+                    "return_text",
+                    "return_full_text",
+                    "clean_up_tokenization_spaces",
+                    "truncation",
+                    "generation_kwargs",
+                    "max_new_tokens",
+                    "num_beams",
+                    "do_sample",
+                    "num_return_sequences",
+                    "max_length",
+                ]
+                if key in kwargs
+            }
+
+            generation_kwargs.update(model_input_kwargs)
+            model_input_kwargs = generation_kwargs
+
+            echo = model_input_kwargs.get("return_full_text", False)
+            max_tokens = model_input_kwargs.get("max_new_tokens", self.max_new_tokens)
+
+            output = self.llm(
+                prompt,  # Prompt
+                max_tokens=max_tokens,  # Generate up to 32 tokens
+                stop=stop_words,  # Stop generating just before the model would generate a new question
+                echo=echo,  # Echo the prompt back in the output
+            )  # Generate a completion, can also call create_completion
+
+            generated_texts = [output["choices"][0]["text"]]
+
+        return generated_texts
diff --git a/scripts/optimizations/Llama_CPP.md b/scripts/optimizations/Llama_CPP.md
@@ -0,0 +1,36 @@
+# Running RAG Pipelines with LLMs on a Llama CPP backend
+
+To run LLM effectively on CPUs, especially on client side machines, we offer a method for running LLMs using the [llama-cpp](https://github.com/ggerganov/llama.cpp).
+We recommend checking out our [tutorial notebook](../../examples/client_inference_with_Llama_cpp.ipynb) with all the details, including processes such as downloading GGUF models.
+
+## Installation
+
+Run the following command to install our dependencies:
+
+```
+pip install -e .[llama_cpp]
+```
+
+For more information regarding the installation process, we recommend checking out the [llama-cpp-python](https://github.com/abetlen/llama-cpp-python) repository.
+
+## Downloading GGUF models
+
+In order to use LlamaCPP, download a gguf model, optimal for llama cpp inference:
+
+```
+huggingface-cli download TheBloke/Marcoroni-7B-v3-GGUF marcoroni-7b-v3.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False
+```
+
+## Loading the Model
+
+Now that our model is downloaded, we can load it in our framework, by specifying the ```LlamaCPPInvocationLayer``` invocation layer.
+
+```python
+PrompterModel = PromptModel(
+    model_name_or_path= "models/marcoroni-7b-v3.Q4_K_M.gguf",
+    invocation_layer_class=LlamaCPPInvocationLayer,
+    model_kwargs= dict(
+        max_new_tokens=100
+    )
+)
+```
diff --git a/scripts/optimizations/README.md b/scripts/optimizations/README.md
@@ -18,3 +18,4 @@ Reduction in bit count leads to a model that requires less memory storage, poten
 | [LLM Quantization](LLM-quantization.md)                             | `optimum-intel`             | CPU     |
 | [Bi-encoder Quantization](embedders/README.md)                      | `optimum-intel`             | CPU     |
 | [Cross-encoder Quantization](reranker_quantization/quantization.md) | `neural-compressor`, `ipex` | CPU     |
+| [LlamaCPP LLMs](Llama_CPP.md)                                       | `llama_cpp`                 | CPU     |
diff --git a/setup.cfg b/setup.cfg
@@ -73,6 +73,9 @@ intel =
     intel-extension-for-transformers
     optimum[neural-compressor]
 
+llama_cpp =
+    llama-cpp-python
+
 [flake8]
 ignore = E501
 max-line-length = 100

Original file line number	Diff line number	Diff line change
`@@ -3,5 +3,6 @@`
`3`	`3`	`from fastrag.prompters.invocation_layers.gaudi_hugging_face_inference import (`
`4`	`4`	`GaudiHFLocalInvocationLayer,`
`5`	`5`	`)`
	`6`	`+from fastrag.prompters.invocation_layers.llama_cpp import LlamaCPPInvocationLayer`
`6`	`7`	`from fastrag.prompters.invocation_layers.ort import ORTInvocationLayer`
`7`	`8`	`from fastrag.prompters.invocation_layers.vqa import VQAHFLocalInvocationLayer`