Inference system optimizer

The inference system optimizer assigns GPU types to inference model servers and decides on the number of replicas for each model for a given request traffic load and classes of service, as well as the batch size. (slides)

Building

docker build -t  inferno . --load

Prerequisites

lp_solve Mixed Integer Linear Programming (MILP) solver

Installation instructions and code
IBM CPLEX (optional)

Information and instructions IBM CPLEX as a solver

Running

First, install prerequisites if running locally (not using an image).

I. Optimizer only

There are two ways to run the optimizer.

Direct function calls: An example is provided in main.go.
```
cd demos/main
go run main.go
```
REST API server: The optimizer may run as a REST API server (steps).

II. Optimized auto-scaler

One may run the optimizer as part of an auto-scaling control system, in one of two ways.

Kubernetes controller: Running in a Kubernetes cluster and using custom resources and a Kubernetes runtime controller, the optimizer may be excercised in reconciliation to updates to the Optimizer custom resource (reference).
Optimization control loop: The control loop comprises (1) a Collector to get data about the inference servers through Prometheus and server deployments, (2) an Optimizer to make decisions, (3) an Actuator to realize such decisions by updating server deployments, and (4) a periodic Controller that has access to static and dynamic data. The control loop may run either externally or in a Kubernetes cluster.

Steps to run the optimizer as a REST API server

The REST API specifications are documented.

Clone this repository and set environment variable INFERNO_REPO to the path to it.

Option A: Run externally

cd $INFERNO_REPO/cmd/optimizer
go run main.go [-F]

The default is to run the server in Stateless mode. Use the optional -F argument to run in Statefull mode. (Description of modes)

You may then curl API commands to http://localhost:8080.

Option B: Run in cluster

Deploy optimizer as a deployment, along with a service on port 80, in name space inferno in the cluster. (The deployment yaml file starts the server in a container with the -F flag.)
```
cd $INFERNO_REPO/manifests/yamls
kubectl apply -f deploy-optimizer.yaml
```
Forward port to local host.
```
kubectl port-forward service/inferno-optimizer -n inferno 8080:80
```
You may then curl API commands (above) to http://localhost:8080.

(Optional) Inspect logs.

POD=$(kubectl get pod -l app=inferno-optimizer -n inferno -o jsonpath="{.items[0].metadata.name}")
kubectl logs -f $POD -n inferno

Cleanup.
```
kubectl delete -f deploy-optimizer.yaml
```

Detailed description of the optimizer

Decision variables

For each pair of (class of service, model):

gpuProfile: the GPU type allocated
numReplicas: the number of replicas
batchSize: the batch size, given continuous batching

Name		Name	Last commit message	Last commit date
Latest commit History 112 Commits
cmd/optimizer		cmd/optimizer
demos		demos
docs		docs
manifests/yamls		manifests/yamls
pkg		pkg
rest-server		rest-server
sample-data @ aea8a55		sample-data @ aea8a55
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Inference system optimizer

Building

Prerequisites

Running

I. Optimizer only

II. Optimized auto-scaler

Steps to run the optimizer as a REST API server

Option A: Run externally

Option B: Run in cluster

Detailed description of the optimizer

Specifications: Accelerators and models

Example 1: Unlimited accelerators

Example 2: Load change - Unlimited accelerators

Example 3: Limited accelerators

Example 4: Load change - Limited accelerators

About

Uh oh!

Releases

Packages

Languages

llm-d-incubation/inferno-autoscaler

Folders and files

Latest commit

History

Repository files navigation

Inference system optimizer

Building

Prerequisites

Running

I. Optimizer only

II. Optimized auto-scaler

Steps to run the optimizer as a REST API server

Option A: Run externally

Option B: Run in cluster

Detailed description of the optimizer

Specifications: Accelerators and models

Example 1: Unlimited accelerators

Example 2: Load change - Unlimited accelerators

Example 3: Limited accelerators

Example 4: Load change - Limited accelerators

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages