Kubernetes AI Toolchain Operator (KAITO)

What is NEW!
ALL vLLM supported models can be run in KAITO now, check the latest release.
Latest Release: Feb 26th, 2026. KAITO v0.9.0.
First Release: Nov 15th, 2023. KAITO v0.1.0.

KAITO is an operator suite that automates LLM model inference, fine-tuning, and RAG (Retrieval Augmented Generation) engine deployment in a Kubernetes cluster. KAITO has the following key differentiations compared to other inference model deployment methodologies:

Simplify the CRD API by removing detailed deployment parameters. The controller provides optimized preset configurations for key inference engine scheduling parameters such as pipeline parallelism (PP), data parallelism (DP), tensor parallelism (TP), max model length, etc.
Use node auto provisioner (NAP) to provision GPU resources with accurate model memory estimation, enabling the controller to pick the optimal node count for distributed inference.
Leverage GPU node built-in local NVMe as model storage — no extra storage is required for inference.
Support any vLLM-supported HuggingFace models.

Architecture

KAITO follows the classic Kubernetes Custom Resource Definition (CRD)/controller design pattern for workload orchestration and integrates with Gateway API Inference Extension to support LLM-based routing.

Workspace: The CRD that serves as the basic building block for managing LLM inference/tuning workloads. The API provides a largely simplified experience for deploying an LLM model in Kubernetes - the user provides the GPU instance type and the HuggingFace model ID, the controller will:
- Estimate the GPU memory requirement based on the GPU instance type and model metadata, and calculate the required GPU count;
- Trigger GPU node auto-provisioning by integrating with Karpenter APIs (NodePool);
- Configure the inference engine parameters for single node/multiple nodes inference with optimized scheduling based on the GPU hardware topology.
Currently, only the vLLM engine is supported. LoRA adapters are supported. KVCache offloading is enabled by default.
InferenceSet: The CRD designed for managing the number of replicas of workspace instances for the same model. It is primarily used to autoscale the workspace based on inference request load. It reacts to scale-up/down actions determined by a KEDA autoscaler that uses vLLM metrics collected by a KEDA plugin.
InferencePool: KAITO integrates Gateway API Inference Extension by creating corresponding InferencePool object and EPP (Endpoint Picker, which enables KVCache-aware routing) per InferenceSet. It can work with any external gateway that supports the inference extension.

Note: In this repo, an open-source gpu-provisioner is used in the E2E test and is referred to in various documents. KAITO can work with any other node provisioners that support the Karpenter-core APIs.

KAITO also supports a RAGEngine operator. It streamlines the process of managing a Retrieval Augmented Generation (RAG) service.

RAGEngine: The CRD that defines the components of a RAG service, including the LLM endpoint (optional), the embedding service and the vector DB. The controller will create all required components.
Vector database: Supports a built-in FAISS in-memory vector database (default), and Qdrant/Milvus persistent databases if specified.
Embedding: Supports both local and remote embedding services to embed documents in the vector database.
RAGService: The core service that leverages the LlamaIndex orchestration. It supports commonly used APIs such as /index for indexing documents, /v1/chat/completion for intercepting LLM calls to append retrieved context automatically, and /retrieve for integrating with MCP servers. The /retrieve API uses the Reciprocal Rank Fusion (RRF) hybrid search algorithm to combine the results from both BM25 sparse retrieval and vector dense retrieval.

The details of the service APIs can be found in this document.

Getting Started

Installation: Please check the guidance here for installing core components (Workspace, InferenceSet) using helm and here for installation using Terraform.
Quick Start: Please check the quick start guidance here for running your first model using KAITO!
AutoScaling: Please check this doc for configuring KAITO and KEDA to enable autoscaling inference workload.
BYO models using HuggingFace runtime: If you plan to run any BYO models using the HuggingFace runtime, check this doc. Note: KAITO only supports BYO models hosted in HuggingFace.
CPU models: Please check this doc for running CPU models using aikit.
RAGEngine: Please check the installation guidance and usage documents here.

Contributing

Get Involved!

Visit #KAITO channel in CNCF Slack to discuss features in development and proposals.
We host a weekly community meeting for contributors on Tuesdays at 4:00pm PST. Please join here: meeting link.
Reference the weekly meeting notes in our KAITO community calls doc!

License

See Apache License 2.0.

Code of Conduct

KAITO has adopted the Cloud Native Compute Foundation Code of Conduct. For more information see the KAITO Code of Conduct.

Contact

Please send emails to "KAITO devs" kaito-dev@microsoft.com for any issues.

Name		Name	Last commit message	Last commit date
Latest commit History 1,374 Commits
.claude-plugin		.claude-plugin
.github		.github
api		api
benchmarks		benchmarks
charts		charts
cmd		cmd
config		config
docker		docker
docs		docs
examples		examples
hack		hack
pkg		pkg
plugins/kaito-workspace		plugins/kaito-workspace
presets		presets
terraform		terraform
test		test
website		website
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.golangci.yaml		.golangci.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.trivyignore		.trivyignore
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
GOVERNANCE.md		GOVERNANCE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
Tiltfile		Tiltfile
codecov.yml		codecov.yml
go.mod		go.mod
go.sum		go.sum
goreleaser.yml		goreleaser.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Kubernetes AI Toolchain Operator (KAITO)

Architecture

Getting Started

Contributing

Get Involved!

License

Code of Conduct

Contact

About

Uh oh!

Releases 30

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Kubernetes AI Toolchain Operator (KAITO)

Architecture

Getting Started

Contributing

Get Involved!

License

Code of Conduct

Contact

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 30

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages