The inference system optimizer assigns GPU types to inference model servers and decides on the number of replicas for each model for a given request traffic load and classes of service, as well as the batch size. (slides)
docker build -t inferno . --load
-
lp_solve Mixed Integer Linear Programming (MILP) solver
-
IBM CPLEX (optional)
Information and instructions IBM CPLEX as a solver
First, install prerequisites if running locally (not using an image).
There are two ways to run the optimizer.
-
Direct function calls: An example is provided in main.go.
cd demos/main go run main.go
-
REST API server: The optimizer may run as a REST API server (steps).
One may run the optimizer as part of an auto-scaling control system, in one of two ways.
-
Kubernetes controller: Running in a Kubernetes cluster and using custom resources and a Kubernetes runtime controller, the optimizer may be excercised in reconciliation to updates to the Optimizer custom resource (reference).
-
Optimization control loop: The control loop comprises (1) a Collector to get data about the inference servers through Prometheus and server deployments, (2) an Optimizer to make decisions, (3) an Actuator to realize such decisions by updating server deployments, and (4) a periodic Controller that has access to static and dynamic data. The control loop may run either externally or in a Kubernetes cluster.
The REST API specifications are documented.
Clone this repository and set environment variable INFERNO_REPO
to the path to it.
cd $INFERNO_REPO/cmd/optimizer
go run main.go [-F]
The default is to run the server in Stateless mode. Use the optional -F
argument to run in Statefull mode. (Description of modes)
You may then curl API commands to http://localhost:8080
.
-
Deploy optimizer as a deployment, along with a service on port
80
, in name spaceinferno
in the cluster. (The deployment yaml file starts the server in a container with the-F
flag.)cd $INFERNO_REPO/manifests/yamls kubectl apply -f deploy-optimizer.yaml
-
Forward port to local host.
kubectl port-forward service/inferno-optimizer -n inferno 8080:80
You may then curl API commands (above) to
http://localhost:8080
. -
(Optional) Inspect logs.
POD=$(kubectl get pod -l app=inferno-optimizer -n inferno -o jsonpath="{.items[0].metadata.name}") kubectl logs -f $POD -n inferno
-
Cleanup.
kubectl delete -f deploy-optimizer.yaml
Decision variables
For each pair of (class of service, model):
- gpuProfile: the GPU type allocated
- numReplicas: the number of replicas
- batchSize: the batch size, given continuous batching