You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+73-75Lines changed: 73 additions & 75 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,24 +9,24 @@
9
9
#### Table of Contents
10
10
11
11
*[🤔 why this tool?](#why-this-tool)
12
-
*[🤓 1 command process](#single-command)
12
+
*[🏗️ Installation](#installation)
13
+
*[🤓 run (1 command)](#run-in-a-single-command)
14
+
*[🐍 TensorRT usage in Python script](#tensorrt-usage-in-python-script)
13
15
*[⏱ benchmarks](#benchmarks)
14
-
*[🤗 end to end reproduction of Infinity Hugging Face demo](./demo/README.md)
15
-
*[🏗️ build from sources](#install-from-sources)
16
-
*[🐍 TensorRT usage in Python script](#usage-in-python-script)
16
+
*[🤗 end to end reproduction of Infinity Hugging Face demo](./demo/README.md) (to replay [Medium article](https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c?source=friends_link&sk=cd880e05c501c7880f2b9454830b8915))
17
17
18
18
#### Why this tool?
19
19
20
-
🐢
21
-
Most tutorials on transformer deployment in production are built over [`Pytorch`](https://pytorch.org/) + [`FastAPI`](https://fastapi.tiangolo.com/).
Most tutorials on transformer deployment in production are built over Pytorch and FastAPI.
22
22
Both are great tools but not very performant in inference.
23
23
24
-
️🏃💨
25
-
Then, if you spend some time, you can build something over [`Microsoft ONNX Runtime`](https://github.com/microsoft/onnxruntime/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server).
Then, if you spend some time, you can build something over ONNX Runtime and Triton inference server.
26
26
You will usually get from 2X to 4X faster inference compared to vanilla Pytorch. It's cool!
27
27
28
-
⚡️🏃💨💨
29
-
However, if you want the best in class performances on GPU, there is only a single choice: [`Nvidia TensorRT`](https://github.com/NVIDIA/TensorRT/) + [`Nvidia Triton inference server`](https://github.com/triton-inference-server/server).
> As you can see we install Transformers and then launch the server itself. This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.
99
+
> As you can see we install Transformers and then launch the server itself.
100
+
> This is of course a bad practice, you should make your own 2 lines Dockerfile with Transformers inside.
101
+
102
+
Right now, only TensorRT 8.0.3 backend is available in Triton.
103
+
Until the TensorRT 8.2 backend is available, we advise you to only use ONNX Runtime Triton backend.
104
+
TensorRT 8.2 is already available in preview and should be released at the end of november 2021.
74
105
75
106
* Query the inference server:
76
107
@@ -83,10 +114,28 @@ curl -X POST http://localhost:8000/v2/models/transformer_onnx_inference/version
83
114
84
115
> check [`demo`](./demo) folder to discover more performant ways to query the server from Python or elsewhere.
85
116
117
+
### TensorRT usage in Python script
118
+
119
+
If you just want to perform inference inside your Python script (without any server) and still get the best TensorRT performance, check:
0 commit comments