Skip to content

Commit 87f7ae2

Browse files
authored
Update README.md
1 parent b79fee8 commit 87f7ae2

File tree

1 file changed

+303
-31
lines changed

1 file changed

+303
-31
lines changed

infer/vllm/README.md

Lines changed: 303 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -2,42 +2,314 @@
22

33
Inference engine implementation using [vLLM](https://github.com/vllm-project/vllm).
44

5-
## Usage
6-
7-
### CUDA
8-
9-
```
10-
/path/to/fmwork/infer/vllm/runner
11-
--mode direct
12-
--dir_work /path/to/workspace
13-
--
14-
driver
15-
--platform cuda
16-
--model_root /path/to/models
17-
--model_name meta-llama/Llama-3.1-8B-Instruct/main
18-
--input_sizes 1024
19-
--output_sizes 1,128
20-
--batch_sizes 1,2,4
21-
--tp_size 1
22-
--reps 5
23-
--engine:enable_prefix_caching@ False
24-
--engine:compilation_config:cudagraph_capture_sizes@ args.batch_sizes
25-
--engine:max_seq_len_to_capture@ 131072
26-
--engine:max_num_seqs@ 64
5+
## Usage examples
6+
7+
### CUDA, direct mode
8+
9+
```bash
10+
/path/to/fmwork/infer/vllm/runner \
11+
--dir_work /path/to/workspace \
12+
--mode direct \
13+
--model_root /path/to/models \
14+
--model_name meta-llama/Llama-3.1-8B-Instruct \
15+
--env PYTHONUNBUFFERED=1 \
16+
--env VLLM_USE_V1=1 \
17+
-- \
18+
driver \
19+
--platform cuda \
20+
--input_sizes 1024 \
21+
--output_sizes 1,128 \
22+
--batch_sizes 1 \
23+
--tp_size 1 \
24+
--reps 5 \
25+
--engine:enable_prefix_caching@ False \
26+
--engine:compilation_config:cudagraph_capture_sizes@ args.batch_sizes \
27+
--engine:max_seq_len_to_capture@ 131072 \
28+
--engine:max_num_seqs@ 64 \
2729
--batch_multiplier 1
2830
```
2931

30-
The vLLM integration currently has the following scripts:
31-
- `runner`: Environment and experiment set up based on execution `--mode`.
32-
- `driver`: Implementation of vLLM benchmark in direct (offline, static) mode.
33-
- `client`: Client piece of server-mode benchmarking.
34-
- `server`: Server piece of server-mode benchmarking.
35-
- `process`: Process results.
32+
### CUDA, server mode
3633

37-
### Spyre
34+
```bash
35+
/path/to/fmwork/infer/vllm/runner \
36+
--dir_work /path/to/workspace \
37+
--mode server \
38+
--model_root /path/to/models \
39+
--model_name meta-llama/Llama-3.1-8B-Instruct \
40+
-- \
41+
server \
42+
--env PYTHONUNBUFFERED=1 \
43+
--env VLLM_USE_V1=1 \
44+
--tensor-parallel-size 1 \
45+
--no-enable-prefix-caching \
46+
--max-num-seqs 1 \
47+
-- \
48+
client \
49+
--env PYTHONUNBUFFERED=1 \
50+
--dataset-name random \
51+
--random-input-len 1024 \
52+
--random-output-len 128 \
53+
--num-prompts 128
54+
```
3855

39-
## Example of output
56+
### Spyre, direct mode, CB disabled
4057

41-
## More on parameters
58+
```bash
59+
/path/to/fmwork/infer/vllm/runner \
60+
--dir_work /path/to/workspace \
61+
--mode direct \
62+
--model_root /path/to/models \
63+
--model_name meta-llama/Llama-3.1-8B-Instruct \
64+
--env PYTHONUNBUFFERED=1 \
65+
--env DTLOG_LEVEL=error \
66+
--env DT_DEEPRT_VERBOSE=-1 \
67+
--env DTCOMPILER_KEEP_EXPORT=-1 \
68+
--env TORCH_SENDNN_LOG=CRITICAL \
69+
--env VLLM_USE_V1=1 \
70+
--env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
71+
--env FLEX_RDMA_MODE_FULL=FALSE \
72+
--env FLEX_HDMA_MODE_FULL=1 \
73+
--env OMP_NUM_THREADS=32 \
74+
--env VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
75+
--env VLLM_SPYRE_WARMUP_NEW_TOKENS=128 \
76+
--env VLLM_SPYRE_WARMUP_BATCH_SIZES=1 \
77+
-- \
78+
driver \
79+
--platform spyre \
80+
--input_sizes 1024 \
81+
--output_sizes 1,128 \
82+
--batch_sizes 1 \
83+
--tp_size 4 \
84+
--engine:max_model_len@ 2048 \
85+
--engine:max_num_seqs@ 1 \
86+
--engine:enable_prefix_caching@ False \
87+
--engine:compilation_config:cudagraph_capture_sizes@ args.batch_sizes \
88+
--engine:max_seq_len_to_capture@ 131072 \
89+
--batch_multiplier 1 \
90+
--reps 5
91+
```
92+
93+
### Spyre, direct mode, CB enabled
94+
95+
```bash
96+
/path/to/fmwork/infer/vllm/runner \
97+
--dir_work /path/to/workspace \
98+
--mode direct \
99+
--model_root /path/to/models \
100+
--model_name meta-llama/Llama-3.1-8B-Instruct \
101+
--env PYTHONUNBUFFERED=1 \
102+
--env DTLOG_LEVEL=error \
103+
--env DT_DEEPRT_VERBOSE=-1 \
104+
--env DTCOMPILER_KEEP_EXPORT=-1 \
105+
--env TORCH_SENDNN_LOG=CRITICAL \
106+
--env VLLM_USE_V1=1 \
107+
--env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
108+
--env FLEX_RDMA_MODE_FULL=FALSE \
109+
--env FLEX_HDMA_MODE_FULL=1 \
110+
--env OMP_NUM_THREADS=32 \
111+
--env VLLM_SPYRE_USE_CB=1 \
112+
-- \
113+
driver \
114+
--platform spyre \
115+
--input_sizes 1024 \
116+
--output_sizes 1,128 \
117+
--batch_sizes 1 \
118+
--tp_size 4 \
119+
--engine:max_model_len@ 2048 \
120+
--engine:max_num_seqs@ 1 \
121+
--engine:enable_prefix_caching@ False \
122+
--engine:compilation_config:cudagraph_capture_sizes@ args.batch_sizes \
123+
--engine:max_seq_len_to_capture@ 131072 \
124+
--batch_multiplier 1 \
125+
--reps 5
126+
```
127+
128+
### Spyre, server mode, CB disabled
129+
130+
```bash
131+
/path/to/fmwork/infer/vllm/runner \
132+
--dir_work /path/to/workspace \
133+
--dir_pref 20250813-tests/005 \
134+
--mode server \
135+
--model_root /path/to/models \
136+
--model_name meta-llama/Llama-3.1-8B-Instruct \
137+
-- \
138+
server \
139+
--env PYTHONUNBUFFERED=1 \
140+
--env DTLOG_LEVEL=error \
141+
--env DT_DEEPRT_VERBOSE=-1 \
142+
--env DTCOMPILER_KEEP_EXPORT=-1 \
143+
--env TORCH_SENDNN_LOG=CRITICAL \
144+
--env VLLM_USE_V1=1 \
145+
--env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
146+
--env FLEX_RDMA_MODE_FULL=FALSE \
147+
--env FLEX_HDMA_MODE_FULL=1 \
148+
--env OMP_NUM_THREADS=32 \
149+
--env VLLM_SPYRE_WARMUP_PROMPT_LENS=1024 \
150+
--env VLLM_SPYRE_WARMUP_NEW_TOKENS=128 \
151+
--env VLLM_SPYRE_WARMUP_BATCH_SIZES=1 \
152+
--no-enable-prefix-caching \
153+
--max-model-len 2048 \
154+
--max-num-seqs 1 \
155+
--tensor-parallel-size 4 \
156+
-- \
157+
client \
158+
--env PYTHONUNBUFFERED=1 \
159+
--dataset-name random \
160+
--random-input-len 1024 \
161+
--random-output-len 128 \
162+
--num-prompts 16
163+
```
164+
165+
### Spyre, server mode, CB enabled
166+
167+
```bash
168+
/path/to/fmwork/infer/vllm/runner \
169+
--dir_work /path/to/workspace \
170+
--mode server \
171+
--model_root /path/to/models \
172+
--model_name meta-llama/Llama-3.1-8B-Instruct \
173+
-- \
174+
server \
175+
--env PYTHONUNBUFFERED=1 \
176+
--env DTLOG_LEVEL=error \
177+
--env DT_DEEPRT_VERBOSE=-1 \
178+
--env DTCOMPILER_KEEP_EXPORT=-1 \
179+
--env TORCH_SENDNN_LOG=CRITICAL \
180+
--env VLLM_USE_V1=1 \
181+
--env VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 \
182+
--env FLEX_RDMA_MODE_FULL=FALSE \
183+
--env FLEX_HDMA_MODE_FULL=1 \
184+
--env OMP_NUM_THREADS=32 \
185+
--env VLLM_SPYRE_USE_CB=1 \
186+
--no-enable-prefix-caching \
187+
--max-model-len 2048 \
188+
--max-num-seqs 1 \
189+
--tensor-parallel-size 4 \
190+
-- \
191+
client \
192+
--env PYTHONUNBUFFERED=1 \
193+
--dataset-name random \
194+
--random-input-len 1024 \
195+
--random-output-len 128 \
196+
--num-prompts 16
197+
```
42198

43199
## Processing results
200+
201+
Example of outputs from first example above (executed on NVIDIA H100):
202+
203+
```
204+
FMWORK SETUP 65.255954
205+
206+
--------------------------------------------------------------------------------
207+
RUN 1024 / 1 / 1
208+
--------------------------------------------------------------------------------
209+
210+
/net/storage149/mnt/md0/nmg/projects/fmwork/github.com/IBM/dev/fmwork/infer/vllm/driver:159: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
211+
timings = bench_combo_rep(
212+
FMWORK REP 1 5 1755063804.569356785 1755063811.294698536 meta-llama/Llama-3.1-8B-Instruct/main 1024 1 1 1 6.725341751 6725.342 0.1
213+
FMWORK REP 2 5 1755063811.295244652 1755063811.324200424 meta-llama/Llama-3.1-8B-Instruct/main 1024 1 1 1 0.028955772 28.956 34.5
214+
FMWORK REP 3 5 1755063811.325461863 1755063811.352333914 meta-llama/Llama-3.1-8B-Instruct/main 1024 1 1 1 0.026872051 26.872 37.2
215+
FMWORK REP 4 5 1755063811.353573998 1755063811.380560835 meta-llama/Llama-3.1-8B-Instruct/main 1024 1 1 1 0.026986837 26.987 37.1
216+
FMWORK REP 5 5 1755063811.381792484 1755063811.409563334 meta-llama/Llama-3.1-8B-Instruct/main 1024 1 1 1 0.027770850 27.771 36.0
217+
218+
Timestamp start = 1755063804.569356785
219+
Timestamp end = 1755063811.409563334
220+
Model name = meta-llama/Llama-3.1-8B-Instruct/main
221+
Input size = 1024
222+
Output size = 1
223+
Batch size = 1
224+
Batch size multiplier = 1
225+
Tensor parallel size = 1
226+
Relative med. abs. dev. = 0.016
227+
RES: Inference time (s) = 0.027
228+
RES: Inter-token latency (ms) = 27.379
229+
RES: Throughput (tok/s) = 36.5
230+
231+
FMWORK RES 1755063804.569356785 1755063811.409563334 meta-llama/Llama-3.1-8B-Instruct/main 1024 1 1 1 0.016 0.027 27.379 36.5
232+
233+
--------------------------------------------------------------------------------
234+
RUN 1024 / 128 / 1
235+
--------------------------------------------------------------------------------
236+
237+
FMWORK REP 1 5 1755063811.412279354 1755063812.404891550 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.992612196 7.755 129.0
238+
FMWORK REP 2 5 1755063812.406126905 1755063813.361175403 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.955048498 7.461 134.0
239+
FMWORK REP 3 5 1755063813.362408263 1755063814.279993320 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.917585057 7.169 139.5
240+
FMWORK REP 4 5 1755063814.281200124 1755063815.197474959 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.916274835 7.158 139.7
241+
FMWORK REP 5 5 1755063815.198693247 1755063816.115997318 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.917304071 7.166 139.5
242+
243+
Timestamp start = 1755063811.412279354
244+
Timestamp end = 1755063816.115997318
245+
Model name = meta-llama/Llama-3.1-8B-Instruct/main
246+
Input size = 1024
247+
Output size = 128
248+
Batch size = 1
249+
Batch size multiplier = 1
250+
Tensor parallel size = 1
251+
Relative med. abs. dev. = 0.001
252+
RES: Inference time (s) = 0.917
253+
RES: Inter-token latency (ms) = 7.168
254+
RES: Throughput (tok/s) = 139.5
255+
256+
FMWORK RES 1755063811.412279354 1755063816.115997318 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.001 0.917 7.168 139.5
257+
258+
Timestamp start = 1755063811.412279354
259+
Timestamp end = 1755063816.115997318
260+
Model name = meta-llama/Llama-3.1-8B-Instruct/main
261+
Input size = 1024
262+
Output size = 128
263+
Batch size = 1
264+
Batch size multiplier = 1
265+
Tensor parallel size = 1
266+
Relative med. abs. dev. TTFT = 0.016
267+
Relative med. abs. dev. INF = 0.001
268+
GEN: [ INF ] Inference time (s) = 0.917
269+
GEN: [ GEN ] Generation time (s) = 0.890
270+
GEN: [ TTFT ] Time to first token (s) = 0.027
271+
GEN: [ ITL ] Inter-token latency (ms) = 6.954
272+
GEN: [ THP ] Throughput (tok/s) = 143.8
273+
274+
FMWORK GEN 1755063811.412279354 1755063816.115997318 meta-llama/Llama-3.1-8B-Instruct/main 1024 128 1 1 0.016 0.001 0.917 0.890 0.027 6.954 143.8
275+
```
276+
277+
After using `process` script (provide the path to the experiment folder):
278+
279+
```json
280+
[
281+
{
282+
"timestamp": "1755063811.412279354",
283+
"metadata_id": null,
284+
"engine": "fmwork/infer/vllm",
285+
"model": "meta-llama/Llama-3.1-8B-Instruct/main",
286+
"precision": null,
287+
"input": 1024,
288+
"output": 128,
289+
"batch": 1,
290+
"tp": 1,
291+
"opts": [
292+
"--env PYTHONUNBUFFERED=1",
293+
"--env VLLM_USE_V1=1",
294+
"--batch_multiplier 1",
295+
"--batch_sizes 1",
296+
"--engine:compilation_config:cudagraph_capture_sizes@ args.batch_sizes",
297+
"--engine:enable_prefix_caching@ False",
298+
"--engine:max_num_seqs@ 64",
299+
"--engine:max_seq_len_to_capture@ 131072",
300+
"--input_sizes 1024",
301+
"--model_name meta-llama/Llama-3.1-8B-Instruct/main",
302+
"--model_root /net/storage149/autofs/css22/nmg/models/hf",
303+
"--output_sizes 1,128",
304+
"--platform cuda",
305+
"--reps 5",
306+
"--tp_size 1"
307+
],
308+
"warmup": null,
309+
"setup": 65.255954,
310+
"ttft": 0.027,
311+
"itl": 6.954,
312+
"thp": 143.8
313+
}
314+
]
315+
```

0 commit comments

Comments
 (0)