Skip to content

Commit 6b64bb9

Browse files
authored
Merge pull request #147 from mohamednaji7/BitsAndBytes
completing the "bitsandbytes" option - based on https://docs.vllm.ai/en/stable/quantization/bnb.html
2 parents d77c53c + 0428824 commit 6b64bb9

File tree

4 files changed

+9
-4
lines changed

4 files changed

+9
-4
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -125,7 +125,7 @@ Below is a summary of the available RunPod Worker images, categorized by image s
125125
| `MAX_NUM_SEQS` | 256 | `int` | Maximum number of sequences per iteration. |
126126
| `MAX_LOGPROBS` | 20 | `int` | Max number of log probs to return when logprobs is specified in SamplingParams. |
127127
| `DISABLE_LOG_STATS` | False | `bool` | Disable logging statistics. |
128-
| `QUANTIZATION` | None | ['awq', 'squeezellm', 'gptq'] | Method used to quantize the weights. |
128+
| `QUANTIZATION` | None | ['awq', 'squeezellm', 'gptq', 'bitsandbytes'] | Method used to quantize the weights. |
129129
| `ROPE_SCALING` | None | `dict` | RoPE scaling configuration in JSON format. |
130130
| `ROPE_THETA` | None | `float` | RoPE theta. Use with rope_scaling. |
131131
| `TOKENIZER_POOL_SIZE` | 0 | `int` | Size of tokenizer pool to use for asynchronous tokenization. |

builder/requirements.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,9 @@ pyarrow
44
runpod~=1.7.7
55
huggingface-hub
66
packaging
7-
typing-extensions==4.7.1
7+
typing-extensions>=4.8.0
88
pydantic
99
pydantic-settings
1010
hf-transfer
1111
transformers
12+
bitsandbytes>=0.45.0

src/engine_args.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,9 @@ def get_engine_args():
147147

148148
# Rename and match to vllm args
149149
args = match_vllm_args(args)
150+
151+
if args.get("load_format") == "bitsandbytes":
152+
args["quantization"] = args["load_format"]
150153

151154
# Set tensor parallel size and max parallel loading workers if more than 1 GPU is available
152155
num_gpus = device_count()

worker-config.json

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -802,14 +802,15 @@
802802
"env_var_name": "QUANTIZATION",
803803
"value": "",
804804
"title": "Quantization",
805-
"description": "Method used to quantize the weights.",
805+
"description": "Method used to quantize the weights.\nif the `Load Format` is 'bitsandbytes' then `Quantization` will be forced to 'bitsandbytes'",
806806
"required": false,
807807
"type": "select",
808808
"options": [
809809
{ "value": "None", "label": "None" },
810810
{ "value": "awq", "label": "AWQ" },
811811
{ "value": "squeezellm", "label": "SqueezeLLM" },
812-
{ "value": "gptq", "label": "GPTQ" }
812+
{ "value": "gptq", "label": "GPTQ" },
813+
{ "value": "bitsandbytes", "label": "bitsandbytes" }
813814
]
814815
},
815816
"ROPE_SCALING": {

0 commit comments

Comments
 (0)