Skip to content

Commit fbfc1e9

Browse files
some polishing (#10)
1 parent 1fe4ab0 commit fbfc1e9

File tree

7 files changed

+88
-87
lines changed

7 files changed

+88
-87
lines changed

Dockerfile

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
FROM nvcr.io/nvidia/tritonserver:21.10-py3
1+
FROM nvcr.io/nvidia/tritonserver:21.11-py3
22

33
# see .dockerignore to check what is transfered
44
COPY . ./

README.md

Lines changed: 73 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -9,24 +9,24 @@
99
#### Table of Contents
1010

1111
* [🤔 why this tool?](#why-this-tool)
12-
* [🤓 1 command process](#single-command)
12+
* [🏗️ Installation](#installation)
13+
* [🤓 run (1 command)](#run-in-a-single-command)
14+
* [🐍 TensorRT usage in Python script](#tensorrt-usage-in-python-script)
1315
* [⏱ benchmarks](#benchmarks)
14-
* [🤗 end to end reproduction of Infinity Hugging Face demo](./demo/README.md)
15-
* [🏗️ build from sources](#install-from-sources)
16-
* [🐍 TensorRT usage in Python script](#usage-in-python-script)
16+
* [🤗 end to end reproduction of Infinity Hugging Face demo](./demo/README.md) (to replay [Medium article](https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915))
1717

1818
#### Why this tool?
1919

20-
🐢
21-
Most tutorials on transformer deployment in production are built over [`Pytorch`](https://pytorch.org/) + [`FastAPI`](https://fastapi.tiangolo.com/).
20+
[`Pytorch`](https://pytorch.org/) + [`FastAPI`](https://fastapi.tiangolo.com/) = 🐢
21+
Most tutorials on transformer deployment in production are built over Pytorch and FastAPI.
2222
Both are great tools but not very performant in inference.
2323

24-
️🏃💨
25-
Then, if you spend some time, you can build something over [`Microsoft ONNX Runtime`](https://github.com/microsoft/onnxruntime/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server).
24+
[`Microsoft ONNX Runtime`](https://github.com/microsoft/onnxruntime/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ️🏃💨
25+
Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server.
2626
You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!
2727

28-
⚡️🏃💨💨
29-
However, if you want the best in class performances on GPU, there is only a single choice: [`Nvidia TensorRT`](https://github.com/NVIDIA/TensorRT/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server).
28+
[`Nvidia TensorRT`](https://github.com/NVIDIA/TensorRT/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server) = ⚡️🏃💨💨
29+
However, if you want the best in class performances on GPU, there is only a single possible combination: Nvidia TensorRT and Triton.
3030
You will usually get 5X faster inference compared to vanilla Pytorch.
3131
Sometimes it can raises up to **10X faster inference**.
3232
Buuuuttt... TensorRT is not easy to use, even less with Transformer models, it requires specific tricks not easy to come with.
@@ -36,7 +36,40 @@ Buuuuttt... TensorRT is not easy to use, even less with Transformer models, it r
3636
> read [📕 Hugging Face Transformer inference UNDER 1 millisecond latency 📖](https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915)
3737
> <img src="resources/rabbit.jpg" width="120">
3838
39-
## Single command
39+
## Installation
40+
41+
<details><summary>Required dependencies</summary>
42+
43+
To install this package locally, you need:
44+
45+
**TensorRT GA build**
46+
* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.0.6
47+
48+
**System Packages**
49+
* [CUDA](https://developer.nvidia.com/cuda-toolkit)
50+
* Recommended versions:
51+
* cuda-11.4.x + cuDNN-8.2
52+
* cuda-10.2 + cuDNN-8.2
53+
* [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1
54+
* [cmake](https://github.com/Kitware/CMake/releases) >= v3.13
55+
* [python](<https://www.python.org/downloads/>) >= v3.6.9
56+
* [pip](https://pypi.org/project/pip/#history) >= v19.0
57+
58+
</details>
59+
60+
```shell
61+
git clone [email protected]:ELS-RD/transformer-deploy.git
62+
cd transformer-deploy
63+
pip3 install .[GPU] -f https://download.pytorch.org/whl/cu113/torch_stable.html
64+
```
65+
66+
To build your own version of the Docker image:
67+
68+
```shell
69+
make docker_build
70+
```
71+
72+
## Run in a single command
4073

4174
With the single command below, you will:
4275

@@ -49,14 +82,7 @@ With the single command below, you will:
4982
* **generate** configuration files for Triton inference server
5083

5184
```shell
52-
docker run -it --rm \
53-
--gpus all \
54-
-v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.1.0 \
55-
bash -c "cd /project && \
56-
convert_model -m roberta-large-mnli \
57-
--backend tensorrt onnx pytorch \
58-
--seq-len 16 128 128 \
59-
--batch-size 1 32 32"
85+
convert_model -m roberta-large-mnli --backend tensorrt onnx pytorch --seq-len 16 128 128 --batch-size 1 32 32
6086
```
6187

6288
> **16 128 128** -> minimum, optimal, maximum sequence length, to help TensorRT better optimize your model
@@ -66,11 +92,16 @@ docker run -it --rm \
6692

6793
```shell
6894
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
69-
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
70-
bash -c "pip install transformers && tritonserver --model-repository=/models"
95+
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \
96+
bash -c "pip install transformers sentencepiece && tritonserver --model-repository=/models"
7197
```
7298

73-
> As you can see we install Transformers and then launch the server itself. This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.
99+
> As you can see we install Transformers and then launch the server itself.
100+
> This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.
101+
102+
Right now, only TensorRT 8.0.3 backend is available in Triton.
103+
Until the TensorRT 8.2 backend is available, we advise you to only use ONNX Runtime Triton backend.
104+
TensorRT 8.2 is already available in preview and should be released at the end of november 2021.
74105

75106
* Query the inference server:
76107

@@ -83,10 +114,28 @@ curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/version
83114

84115
> check [`demo`](./demo) folder to discover more performant ways to query the server from Python or elsewhere.
85116
117+
### TensorRT usage in Python script
118+
119+
If you just want to perform inference inside your Python script (without any server) and still get the best TensorRT performance, check:
120+
121+
* [convert.py](./src/transformer_deploy/convert.py)
122+
* [trt_utils.py](./src/transformer_deploy/backends/trt_utils.py)
123+
124+
#### High level explanations
125+
126+
* call `load_engine()` to parse an existing TensorRT engine
127+
* setup a stream (for async call), a TensorRT runtime and a context
128+
* load your profile(s)
129+
* call `infer_tensorrt()`
130+
131+
... and you are done! 🎉
132+
133+
> if you are looking for inspiration, check [onnx-tensorrt](https://github.com/onnx/onnx-tensorrt)
134+
86135
## Benchmarks
87136

88-
Most transformer encoder based models are supported like Bert, Roberta, miniLM, Camembert, Albert, XLM-R, Distilbert, etc.
89-
Best results are obtained with TensorRT 8.2 (preview).
137+
Most transformer encoder based models are supported like Bert, Roberta, miniLM, Camembert, Albert, XLM-R, Distilbert, etc.
138+
**Best results are obtained with TensorRT 8.2 (preview).**
90139
Below examples are representative of the performance gain to expect from this library.
91140
Other improvements not shown here include GPU memory usage decrease, multi stream, etc.
92141

@@ -231,54 +280,3 @@ latencies:
231280
```
232281

233282
</details>
234-
235-
## Install from sources
236-
237-
```shell
238-
git clone [email protected]:ELS-RD/triton_transformers.git
239-
cd triton_transformers
240-
pip3 install .[GPU] -f https://download.pytorch.org/whl/cu113/torch_stable.html
241-
```
242-
243-
### Prerequisites
244-
245-
To run this package locally, you need:
246-
247-
**TensorRT GA build**
248-
* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.0.6
249-
250-
**System Packages**
251-
* [CUDA](https://developer.nvidia.com/cuda-toolkit)
252-
* Recommended versions:
253-
* cuda-11.4.x + cuDNN-8.2
254-
* cuda-10.2 + cuDNN-8.2
255-
* [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1
256-
* [cmake](https://github.com/Kitware/CMake/releases) >= v3.13
257-
* [python](<https://www.python.org/downloads/>) >= v3.6.9
258-
* [pip](https://pypi.org/project/pip/#history) >= v19.0
259-
260-
### Docker build
261-
262-
You can also build your own version of the Docker image:
263-
264-
```shell
265-
make docker_build
266-
```
267-
268-
### Usage in Python script
269-
270-
If you just want to perform inference inside your Python script (without any server) and still get the best TensorRT performance, check:
271-
272-
* [convert.py](./src/transformer_deploy/convert.py)
273-
* [trt_utils.py](./src/transformer_deploy/backends/trt_utils.py)
274-
275-
#### High level explanations
276-
277-
* call `load_engine()` to parse an existing TensorRT engine
278-
* setup a stream (for async call), a TensorRT runtime and a context
279-
* load your profile(s)
280-
* call `infer_tensorrt()`
281-
282-
... and you are done! 🎉
283-
284-
> if you are looking for inspiration, check [onnx-tensorrt](https://github.com/onnx/onnx-tensorrt)

VERSION

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
0.1.0
1+
0.1.1-dev-1

demo/README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Launch `Nvidia Triton inference server`:
7474
```shell
7575
# add --shm-size 256m -> to have up to 4 Python backends (tokenizer) at the same time (64Mb per instance)
7676
docker run -it --rm --gpus all -p8000:8000 -p8001:8001 -p8002:8002 --shm-size 256m \
77-
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.10-py3 \
77+
-v $PWD/triton_models:/models nvcr.io/nvidia/tritonserver:21.11-py3 \
7878
bash -c "pip install transformers && tritonserver --model-repository=/models"
7979
```
8080

@@ -89,14 +89,14 @@ Measures:
8989
```shell
9090
# need a local installation of the package
9191
# pip install .[GPU]
92-
ubuntu@ip-XXX:~/triton_transformers$ python3 demo/triton_client.py --length 16 --model tensorrt
92+
ubuntu@ip-XXX:~/transformer-deploy$ python3 demo/triton_client.py --length 16 --model tensorrt
9393
10/31/2021 12:09:34 INFO timing [triton transformers]: mean=1.53ms, sd=0.06ms, min=1.48ms, max=1.78ms, median=1.51ms, 95p=1.66ms, 99p=1.74ms
9494
[[-3.4355469 3.2753906]]
9595
```
9696

9797
* 128 tokens + TensorRT:
9898
```shell
99-
ubuntu@ip-XXX:~/triton_transformers$ python3 demo/triton_client.py --length 128 --model tensorrt
99+
ubuntu@ip-XXX:~/transformer-deploy$ python3 demo/triton_client.py --length 128 --model tensorrt
100100
10/31/2021 12:12:00 INFO timing [triton transformers]: mean=1.96ms, sd=0.08ms, min=1.88ms, max=2.24ms, median=1.93ms, 95p=2.17ms, 99p=2.23ms
101101
[[-3.4589844 3.3027344]]
102102
```
@@ -135,7 +135,7 @@ Model analyzer is a powerful tool to adjust the Triton server configuration.
135135
To run it:
136136

137137
```shell
138-
docker run -it --rm --gpus all -v $PWD:/project fast_transformer:0.1.0 \
138+
docker run -it --rm --gpus all -v $PWD:/project ghcr.io/els-rd/transformer-deploy:0.1.0 \
139139
bash -c "model-analyzer profile -f /project/demo/config_analyzer.yaml"
140140
```
141141

demo/config_analyzer.yaml

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ client_protocol: 'http'
66
triton_http_endpoint: 'localhost:8000'
77

88
profile_models:
9-
sts:
9+
transformer_tensorrt_model:
1010
parameters:
1111
batch_sizes: 0
1212

@@ -25,6 +25,9 @@ run_config_search_max_instance_count: 1
2525
perf_analyzer_flags:
2626
percentile: 95
2727
measurement-mode: time_windows
28-
input-data: zero
29-
# shape: |
30-
# input_ids:1,16 attention_mask:1,16
28+
input-data:
29+
- zero
30+
shape:
31+
- input_ids:1,16
32+
- attention_mask:1,16
33+
- token_type_ids:1,16

requirements_gpu.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ nvidia-pyindex
44
tritonclient[all]
55
pycuda
66
torch==1.10.0+cu113
7-
nvidia-tensorrt==8.0.3.4
7+
nvidia-tensorrt
88
onnx_graphsurgeon
99
polygraphy==0.33.0
1010
triton-model-analyzer

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
description="Simple transformer model optimizer and deployment tool",
3939
long_description=open("README.md", "r", encoding="utf-8").read(),
4040
long_description_content_type="text/markdown",
41-
url="https://github.com/ELS-RD/triton_transformers",
41+
url="https://github.com/ELS-RD/transformer-deploy",
4242
package_dir={"": "src"},
4343
packages=find_packages(where="src"),
4444
install_requires=install_requires,

0 commit comments

Comments
 (0)