Skip to content

kaito-project/kaito

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,374 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kubernetes AI Toolchain Operator (KAITO)

GitHub Release Go Report Card GitHub go.mod Go version codecov FOSSA Status

notification What is NEW!
ALL vLLM supported models can be run in KAITO now, check the latest release.
Latest Release: Feb 26th, 2026. KAITO v0.9.0.
First Release: Nov 15th, 2023. KAITO v0.1.0.

KAITO is an operator suite that automates LLM model inference, fine-tuning, and RAG (Retrieval Augmented Generation) engine deployment in a Kubernetes cluster. KAITO has the following key differentiations compared to other inference model deployment methodologies:

  • Simplify the CRD API by removing detailed deployment parameters. The controller provides optimized preset configurations for key inference engine scheduling parameters such as pipeline parallelism (PP), data parallelism (DP), tensor parallelism (TP), max model length, etc.
  • Use node auto provisioner (NAP) to provision GPU resources with accurate model memory estimation, enabling the controller to pick the optimal node count for distributed inference.
  • Leverage GPU node built-in local NVMe as model storage — no extra storage is required for inference.
  • Support any vLLM-supported HuggingFace models.

Architecture

KAITO follows the classic Kubernetes Custom Resource Definition (CRD)/controller design pattern for workload orchestration and integrates with Gateway API Inference Extension to support LLM-based routing.

KAITO architecture
  • Workspace: The CRD that serves as the basic building block for managing LLM inference/tuning workloads. The API provides a largely simplified experience for deploying an LLM model in Kubernetes - the user provides the GPU instance type and the HuggingFace model ID, the controller will:

    • Estimate the GPU memory requirement based on the GPU instance type and model metadata, and calculate the required GPU count;
    • Trigger GPU node auto-provisioning by integrating with Karpenter APIs (NodePool);
    • Configure the inference engine parameters for single node/multiple nodes inference with optimized scheduling based on the GPU hardware topology.

    Currently, only the vLLM engine is supported. LoRA adapters are supported. KVCache offloading is enabled by default.

  • InferenceSet: The CRD designed for managing the number of replicas of workspace instances for the same model. It is primarily used to autoscale the workspace based on inference request load. It reacts to scale-up/down actions determined by a KEDA autoscaler that uses vLLM metrics collected by a KEDA plugin.

  • InferencePool: KAITO integrates Gateway API Inference Extension by creating corresponding InferencePool object and EPP (Endpoint Picker, which enables KVCache-aware routing) per InferenceSet. It can work with any external gateway that supports the inference extension.

Note: In this repo, an open-source gpu-provisioner is used in the E2E test and is referred to in various documents. KAITO can work with any other node provisioners that support the Karpenter-core APIs.

KAITO also supports a RAGEngine operator. It streamlines the process of managing a Retrieval Augmented Generation (RAG) service.

KAITO RAGEngine architecture
  • RAGEngine: The CRD that defines the components of a RAG service, including the LLM endpoint (optional), the embedding service and the vector DB. The controller will create all required components.
  • Vector database: Supports a built-in FAISS in-memory vector database (default), and Qdrant/Milvus persistent databases if specified.
  • Embedding: Supports both local and remote embedding services to embed documents in the vector database.
  • RAGService: The core service that leverages the LlamaIndex orchestration. It supports commonly used APIs such as /index for indexing documents, /v1/chat/completion for intercepting LLM calls to append retrieved context automatically, and /retrieve for integrating with MCP servers. The /retrieve API uses the Reciprocal Rank Fusion (RRF) hybrid search algorithm to combine the results from both BM25 sparse retrieval and vector dense retrieval.

The details of the service APIs can be found in this document.

Getting Started

  • Installation: Please check the guidance here for installing core components (Workspace, InferenceSet) using helm and here for installation using Terraform.
  • Quick Start: Please check the quick start guidance here for running your first model using KAITO!
  • AutoScaling: Please check this doc for configuring KAITO and KEDA to enable autoscaling inference workload.
  • BYO models using HuggingFace runtime: If you plan to run any BYO models using the HuggingFace runtime, check this doc. Note: KAITO only supports BYO models hosted in HuggingFace.
  • CPU models: Please check this doc for running CPU models using aikit.
  • RAGEngine: Please check the installation guidance and usage documents here.

Contributing

Read more

This project welcomes contributions and suggestions. The contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit CLAs for CNCF.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the CLAs for CNCF, please electronically sign the CLA via https://easycla.lfx.linuxfoundation.org. If you encounter issues, you can submit a ticket with the Linux Foundation ID group through the Linux Foundation Support website.

Get Involved!

License

See Apache License 2.0.

FOSSA Status

Code of Conduct

KAITO has adopted the Cloud Native Compute Foundation Code of Conduct. For more information see the KAITO Code of Conduct.

Contact

About

Kubernetes AI Toolchain Operator

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors