Skip to content

Commit 7c898d5 breaks generation on GPU #1089

Closed
@iamlemec

Description

@iamlemec

From commit 7c898d onwards, the output of any type of generation/completion on the GPU is just "#" repeated forever. For instance, using the example from README.md

from llama_cpp import Llama
llm = Llama(model_path='models/llama2-7b.q4_0.gguf', n_gpu_layers=100)
for s in llm('Building a website can be done in 10 simple steps:\nStep 1:', stream=True):
    print(s)

The output is the following repeated:

{'id': 'cmpl-14ed3b80-49af-453d-99a4-c7925f5680f7', 'object': 'text_completion', 'created': 1705351368, 'model': 'models/llama2-7b.q4_0.gguf', 'choices': [{'text': '#', 'index': 0, 'logprobs': None, 'finish_reason': None}]}

Generation works fine on the CPU and for previous commits. Doesn't seem to be related to quantization or model type. Interestingly, generation also works using pure llama.cpp through the main interface for both CPU and GPU. I tested this out for the current master and the commits around the above change (notably 76484fb and 1d11838). I also managed to get it working in llama-cpp-python using the low level API, just using simple batching and llama_decode.

Environment info:

GPU: RTX A6000
OS: Linux 6.6.0-0.rc5
CUDA SDK: 12.2
CUDA Drivers: 535.113.01

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions