Control loop for inference optimizer

The control loop comprises (1) a Collector to get data about the inference servers through Prometheus and server deployments, (2) an Optimizer to make decisions, (3) an Actuator to realize such decisions by updating server deployments, and (4) a periodic Controller that has access to static and dynamic data. The control loop may run either externally or in a Kubernetes cluster.

Running

I. Run control loop externally

Following are the steps to run the optimization control loop external to a cluster.

Steps

Install prerequisites.
Create a Kubernetes cluster and make sure $HOME/.kube/config points to it.
Run script to create terminals for the various components. You may need to install term and add terminal coloring support. (Hint: Change OSX Terminal Settings from Command Line)
```
cd $REPO_BASE/scripts
./launch-terms.sh
```
where $REPO_BASE is the path to this repository.

In this demo there are five components: Collector, Optimizer, Actuator, Controller, and Load Emulator. Terminals for the Collector, Optimizer, Actuator, and Controller are (light) green, red, blue, and yellow, repectively. The Load Emulator is orange. The green terminal is for interaction with the cluster through kubectl commands. And, the beige terminal to observe the currently running pods.
Set the data path to the data (static+dynamic) for the Controller (yellow). [Make sure $REPO_BASE is set.]
```
export INFERNO_DATA_PATH=$REPO_BASE/sample-data/large/
```
Set the environment in all of the (five) component terminals. [Make sure $REPO_BASE is set]
```
. $REPO_BASE/scripts/setparms.sh
```
Deploy sample deployments (green terminal) in namespace infer, representing three inference servers.
```
kubectl apply -f ns.yaml
kubectl apply -f dep1.yaml,dep2.yaml,dep3.yaml
```
Observe (beige) changes in the number of pods (replicas) for all inference servers (deployments).
```
watch kubectl get pods -n infer
```
Run the components.
- Collector (light green), Optimizer (red), and Actuator (blue)
```
go run main.go
```
- Controller (yellow)
```
go run main.go <controlPeriodInSec> <isDynamicMode>
```
  The control period dictates the frequency with which the Controler goes through a control loop (default 60). In addition, the Controler runs as a REST server with an endpoint /invoke for on-demand activation of the control loop. Hence, periodic as well as aperiodic modes are supported simultaneously. Setting controlPeriodInSec to zero makes the Controller run in the aperiodic mode only.
```
curl http://$CONTROLLER_HOST:$CONTROLLER_PORT/invoke
```
  (Default is localhost:3300)
  
  Further, there is an option for running the Controller in dynamic mode. This means that, at the beginning of every control cycle, the (static) data files are read (default false). The arguments for the Controller may also be set through the environment variables INFERNO_CONTROL_PERIOD and INFERNO_CONTROL_DYNAMIC, respectively. The command line arguments override the values of the environment variables.
- Load Emulator (orange)
```
go run main.go <intervalInSec> <alpha (0,1)>
```
  The Load Emulator periodically, given by the argument intervalInSec, pertubs the values of request rate and average number of tokens per request for all inference servers (deployments) in the cluster. The disturbance amount is normally distributed with zero mean and sigma standard deviation, where sigma = alpha * originalUndisturbedValue. (Default arguments are 60 and 0.5, respectively.)
Cleanup
- Stop all (five) components using Ctrl-c
- Delete sample deployments (green terminal)
```
kubectl delete -f dep1.yaml,dep2.yaml,dep3.yaml
kubectl delete -f ns.yaml
```

II. Run control loop in a cluster

Building

To create a docker image for the control loop (excluding the Optimizer). Instructions for the Optimizer are in the optimizer repository.

docker build -t  inferno-loop . --load

Following are the steps to run the optimization control loop within a cluster.

Create or have access to a cluster.
Clone this repository and set environment variable REPO_BASE to the path to it.
Create namespace inferno, where all optimizer components will reside.
```
cd $REPO_BASE/yamls/deploy
kubectl apply -f ns.yaml
```

Create a configmap populated with inferno static data, e.g. samples taken from the large directory.

SAMPLE_DATA_PATH=$REPO_BASE/sample-data/large
kubectl create configmap inferno-static-data -n inferno \
--from-file=/$SAMPLE_DATA_PATH/accelerator-data.json \
--from-file=/$SAMPLE_DATA_PATH/model-data.json \
--from-file=/$SAMPLE_DATA_PATH/serviceclass-data.json \
--from-file=/$SAMPLE_DATA_PATH/optimizer-data.json

Create a configmap populated with inferno dynamic data (count of accelerator types).

kubectl create configmap inferno-dynamic-data -n inferno --from-file=/$SAMPLE_DATA_PATH/capacity-data.json

Deploy inferno in the cluster.
```
kubectl apply -f deploy-loop.yaml
```

Get the inferno pod name.

POD=$(kubectl get pod -l app=inferno -n inferno -o jsonpath="{.items[0].metadata.name}")

Inspect logs.

kubectl logs -f $POD -n inferno -c controller
kubectl logs -f $POD -n inferno -c collector
kubectl logs -f $POD -n inferno -c optimizer
kubectl logs -f $POD -n inferno -c actuator

Create deployments representing inference servers in namespace infer.

cd $REPO_BASE/yamls/workload
kubectl apply -f ns.yaml
kubectl apply -f dep1.yaml,dep2.yaml,dep3.yaml

Note that the deployment should have the following labels set (a missing service class name defaults to Free)

labels:
    inferno.server.managed: "true"
    inferno.server.name: vllm-001
    inferno.server.model: llama_13b
    inferno.server.class: Premium
    inferno.server.allocation.accelerator: MI250

and some optional labels (if metrics are not available from Pometheus).

labels:
    inferno.server.allocation.maxbatchsize: "8"
    inferno.server.load.rpm: "30"
    inferno.server.load.numtokens: "2048"

Observe changes in the number of pods (replicas) for all inference servers (deployments).
```
watch kubectl get pods -n infer
```

(Optional) Start a load emulator to inference servers.

cd $REPO_BASE/yamls/deploy
kubectl apply -f load-emulator.yaml
kubectl logs -f load-emulator -n inferno

Invoke an inferno control loop.

kubectl port-forward service/inferno -n inferno 8080:80
curl http://localhost:8080/invoke

Cleanup

cd $REPO_BASE/yamls/deploy
kubectl delete -f load-emulator.yaml
kubectl delete -f deploy-loop.yaml 
kubectl delete configmap inferno-static-data inferno-dynamic-data -n inferno
kubectl delete -f ns.yaml

cd $REPO_BASE/yamls/workload
kubectl delete -f dep1.yaml,dep2.yaml,dep3.yaml
kubectl delete -f ns.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cmd		cmd
docs/figs		docs/figs
pkg		pkg
sample-data @ aea8a55		sample-data @ aea8a55
scripts		scripts
yamls		yamls
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Control loop for inference optimizer

Running

I. Run control loop externally

Steps

II. Run control loop in a cluster

Building

About

Uh oh!

Releases

Packages

Languages

License

llm-inferno/control-loop

Folders and files

Latest commit

History

Repository files navigation

Control loop for inference optimizer

Running

I. Run control loop externally

Steps

II. Run control loop in a cluster

Building

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages