Token generation broken on CUDA when `offload_kqv` is `false`

Originally spotted by @iamlemec in https://github.com/abetlen/llama-cpp-python/issues/1089 reproduced with llama.cpp by passing `--no_kv_offload` to `./main`. Bug causes the model to generate repeated `#`'s instead of a valid completion.