Kubernetes AI Toolchain Operator (KAITO)

What is NEW!
Retrieval Augmented Generation (RAG) support is live! - KAITO RagEngine uses LlamaIndex and FAISS, learn about it here!
Latest Release: Aug 7th, 2025. KAITO v0.6.0
First Release: Nov 15th, 2023. KAITO v0.1.0.

KAITO is an operator that automates the AI/ML model inference or tuning workload in a Kubernetes cluster. The target models are popular open-sourced large models such as phi-4 and llama. KAITO has the following key differentiations compared to most of the mainstream model deployment methodologies built on top of virtual machine infrastructures:

Manage large model files using container images. An OpenAI-compatible server is provided to perform inference calls.
Provide preset configurations to avoid adjusting workload parameters based on GPU hardware.
Provide support for popular open-sourced inference runtimes: vLLM and transformers.
Auto-provision GPU nodes based on model requirements.
Host large model images in the public Microsoft Container Registry (MCR) if the license allows.

Using KAITO, the workflow of onboarding large AI inference models in Kubernetes is largely simplified.

Architecture

KAITO follows the classic Kubernetes Custom Resource Definition(CRD)/controller design pattern. User manages a workspace custom resource which describes the GPU requirements and the inference or tuning specification. KAITO controllers will automate the deployment by reconciling the workspace custom resource.

The above figure presents the KAITO architecture overview. Its major components consist of:

Workspace controller: It reconciles the workspace custom resource, creates NodeClaim (explained below) custom resources to trigger node auto provisioning, and creates the inference or tuning workload (deployment, statefulset or job) based on the model preset configurations.
Node provisioner controller: The controller's name is gpu-provisioner in gpu-provisioner helm chart. It uses the NodeClaim CRD originated from Karpenter to interact with the workspace controller. It integrates with Azure Resource Manager REST APIs to add new GPU nodes to the AKS or AKS Arc cluster.

Note: The gpu-provisioner is an open sourced component. It can be replaced by other controllers if they support Karpenter-core APIs.

NEW! Starting with version v0.5.0, KAITO releases a new operator, RAGEngine, which is used to streamline the process of managing a Retrieval Augmented Generation(RAG) service.

As illustrated in the above figure, the RAGEngine controller reconciles the ragengine custom resource and creates a RAGService deployment. The RAGService provides the following capabilities:

Orchestration: use LlamaIndex orchestrator.
Embedding: support both local and remote embedding services, to embed queries and documents in the vector database.
Vector database: support a built-in faiss in-memory vector database. Remote vector database support will be added soon.
Backend inference: support any OAI compatible inference service.

The details of the service APIs can be found in this document.

Installation

Workspace: Please check the installation guidance here for deployment using helm and here for deployment using Terraform.
RAGEngine: Please check the installation guidance here.

Workspace quick start

After installing KAITO, one can try following commands to start a phi-3.5-mini-instruct inference service.

$ cat examples/inference/kaito_workspace_phi_3.5-instruct.yaml
apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: workspace-phi-3-5-mini
resource:
  instanceType: "Standard_NC24ads_A100_v4"
  labelSelector:
    matchLabels:
      apps: phi-3-5
inference:
  preset:
    name: phi-3.5-mini-instruct

$ kubectl apply -f examples/inference/kaito_workspace_phi_3.5-instruct.yaml

The workspace status can be tracked by running the following command. When the WORKSPACESUCCEEDED column becomes True, the model has been deployed successfully.

$ kubectl get workspace workspace-phi-3-5-mini
NAME                     INSTANCE                   RESOURCEREADY   INFERENCEREADY   JOBSTARTED   WORKSPACESUCCEEDED   AGE
workspace-phi-3-5-mini   Standard_NC24ads_A100_v4   True            True                          True                 4h15m

Next, one can find the inference service's cluster ip and use a temporal curl pod to test the service endpoint in the cluster.

# find service endpoint
$ kubectl get svc workspace-phi-3-5-mini
NAME                     TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)            AGE
workspace-phi-3-5-mini   ClusterIP   <CLUSTERIP>  <none>        80/TCP,29500/TCP   10m
$ export CLUSTERIP=$(kubectl get svc workspace-phi-3-5-mini -o jsonpath="{.spec.clusterIPs[0]}")

# find available models
$ kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -s  http://$CLUSTERIP/v1/models | jq
{
  "object": "list",
  "data": [
    {
      "id": "phi-3.5-mini-instruct",
      "object": "model",
      "created": 1733370094,
      "owned_by": "vllm",
      "root": "/workspace/vllm/weights",
      "parent": null,
      "max_model_len": 16384
    }
  ]
}

# make an inference call using the model id (phi-3.5-mini-instruct) from previous step
$ kubectl run -it --rm --restart=Never curl --image=curlimages/curl -- curl -X POST http://$CLUSTERIP/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-3.5-mini-instruct",
    "messages": [{"role": "user", "content": "What is kubernetes?"}],
    "max_tokens": 50,
    "temperature": 0
  }'

Usage

The detailed usage for KAITO supported models can be found in HERE. In case users want to deploy their own containerized models, they can provide the pod template in the inference field of the workspace custom resource (please see API definitions for details).

Note: Currently the controller does NOT handle automatic model upgrade. It only creates inference workloads based on the preset configurations if the workloads do not exist.

The number of the supported models in KAITO is growing! Please check this document to see how to add a new supported model. Refer to tuning document, inference document , RAGEngine document and FAQ for more information.

Contributing

Get Involved!

Join our KAITO Community Slack to discuss features in development and proposals.
We host a weekly community meeting for contributors on Tuesdays at 4:00pm PST. Please join here: meeting link.
Reference the weekly meeting notes in our KAITO community calls doc!

License

See Apache License 2.0.

Code of Conduct

KAITO has adopted the Cloud Native Compute Foundation Code of Conduct. For more information see the KAITO Code of Conduct.

Contact

Please send emails to "KAITO devs" [email protected] for any issues.

Name		Name	Last commit message	Last commit date
Latest commit History 1,062 Commits
.github		.github
api		api
charts		charts
cmd		cmd
config		config
demo		demo
docker		docker
docs		docs
examples		examples
hack		hack
pkg		pkg
presets		presets
terraform		terraform
test		test
website		website
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
Tiltfile		Tiltfile
codecov.yml		codecov.yml
go.mod		go.mod
go.sum		go.sum
goreleaser.yml		goreleaser.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Kubernetes AI Toolchain Operator (KAITO)

Architecture

Installation

Workspace quick start

Usage

Contributing

Get Involved!

License

Code of Conduct

Contact

About

Uh oh!

Releases 18

Uh oh!

Contributors 46

Uh oh!

Languages

License

kaito-project/kaito

Folders and files

Latest commit

History

Repository files navigation

Kubernetes AI Toolchain Operator (KAITO)

Architecture

Installation

Workspace quick start

Usage

Contributing

Get Involved!

License

Code of Conduct

Contact

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 18

Uh oh!

Contributors 46

Uh oh!

Languages