Post-training

Post-training#

What is MaxText post-training?#

MaxText provides performance and scalable LLM and VLM post-training, across a variety of techniques like SFT and GRPO.

We’re investing in performance, scale, algorithms, models, reliability, and ease of use to provide the most competitive OSS solution available.

The MaxText stack#

MaxText was co-designed with key Google led innovations to provide a unified post training experience:

MaxText model library for JAX LLMs highly optimized for TPUs
Tunix for the latest algorithms and post-training techniques
vLLM on TPU for high performance sampling (inference) for Reinforcement Learning (RL)
Pathways for multi-host inference (sampling) and highly efficient weight transfer

GRPO Diagram

Supported techniques & models#

SFT (Supervised Fine-Tuning)
- SFT on Single-Host TPUs
- SFT on Multi-Host TPUs
Multimodal SFT
- Multimodal Support
Reinforcement Learning (RL)
- RL on Single-Host TPUs
- RL on Multi-Host TPUs

Step by step RL#

Making powerful RL accessible is at the core of the MaxText mission

Here is an example of the steps you might go through to run a Reinforcement Learning (RL) job:

RL Workflow

What is Pathways and why is it key for RL?#

Pathways is a single controller JAX runtime that was designed and pressure tested internally at Google DeepMind over many years. Now available on Google Cloud, it is designed to coordinate distributed computations across thousands of accelerators from a single Python program. It efficiently performs data transfers between accelerators both within a slice using ICI (Inter-chip Interconnect) and across slices over DCN (Data Center Network).

Pathways allows for fine grained resource allocation (subslice of a physical slice) and scheduling. This allows JAX developers to explore novel model architectures in an easy to develop single controller programming environment.

Pathways supercharges RL with:

Multi-host Model Support: Easily manages models that span multiple hosts.
Unified Orchestration: Controls both trainers and samplers from a single Python process.
Efficient Data Transfer: Optimally moves data between training and inference devices, utilizing ICI or DCN as needed. JAX Reshard primitives simplify integration.
Flexible Resource Allocation: Enables dedicating different numbers of accelerators to inference and training within the same job, adapting to workload bottlenecks (disaggregated setup).

Getting started#

Start your Post-Training journey through quick experimentation with Python Notebooks or our Production level tutorials for SFT and RL.