Commit `7c898d5` breaks generation on GPU

From commit `7c898d` onwards, the output of any type of generation/completion on the GPU is just "#" repeated forever. For instance, using the example from `README.md`

```python
from llama_cpp import Llama
llm = Llama(model_path='models/llama2-7b.q4_0.gguf', n_gpu_layers=100)
for s in llm('Building a website can be done in 10 simple steps:\nStep 1:', stream=True):
    print(s)
```

The output is the following repeated:
```
{'id': 'cmpl-14ed3b80-49af-453d-99a4-c7925f5680f7', 'object': 'text_completion', 'created': 1705351368, 'model': 'models/llama2-7b.q4_0.gguf', 'choices': [{'text': '#', 'index': 0, 'logprobs': None, 'finish_reason': None}]}
```

Generation works fine on the CPU and for previous commits. Doesn't seem to be related to quantization or model type. Interestingly, generation also works using pure `llama.cpp` through the `main` interface for both CPU and GPU. I tested this out for the current `master` and the commits around the above change (notably `76484fb` and `1d11838`). I also managed to get it working in `llama-cpp-python` using the low level API, just using simple batching and `llama_decode`.

Environment info:
```
GPU: RTX A6000
OS: Linux 6.6.0-0.rc5
CUDA SDK: 12.2
CUDA Drivers: 535.113.01
```

Thanks!



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit `7c898d5` breaks generation on GPU #1089

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Commit 7c898d5 breaks generation on GPU #1089

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Commit `7c898d5` breaks generation on GPU #1089