Description
This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine.
✅ UPDATE: All core GPT-OSS functionality is now available in Megatron Core (training) and Megatron Bridge (checkpoint conversion).
MoE Layer
Enabled Bias
Attention Mechanisms
Alternating Sliding-Window Attention Pattern
- Status: ✅ Supported - Infrastructure exists for per-layer patterns and sliding window attention using TE
Attention Sinks
Activation Functions
Custom SwiGLU with Clamping
- Status: ✅ Supported
- Implementation: Megatron Core added partially fused version as "custom quick GeGLU"
FP8-aware fused kernel merged into Transformer Engine
Positional Encodings
YaRN RoPE Scaling
- Status: ✅ Fully Supported
- Implementation:
- Usage:
--position-embedding-type yarn with YaRN configuration parameters
- Reference: arXiv:2309.00071
Megatron Bridge Support
Megatron Bridge provides full GPT-OSS integration:
- ✅ Checkpoint Conversion: Hugging Face ↔ Megatron format
- ✅ Pre-configured Providers:
GPTOSSProvider20B and GPTOSSProvider120B
- ✅ Quantization Support: Handles MXFP4 weight dequantization
Megatron Bridge + Megatron-LM Example
PR: #2383 provides end-to-end example scripts covering checkpoint conversion (convert_mcore_bf16_checkpoint_from_hf.py) and training/fine-tuning (training_gptoss_20b_h100_bf16_fp8.sh)
Credits: @cuichenx for core implementation, @yiakwy-xpu-ml-framework-team for example scripts
Description
This outlines the current status of gpt-oss features that need to be implemented in Megatron Core, leveraging Transformer Engine.
✅ UPDATE: All core GPT-OSS functionality is now available in Megatron Core (training) and Megatron Bridge (checkpoint conversion).
MoE Layer
Enabled Bias
Attention Mechanisms
Alternating Sliding-Window Attention Pattern
Attention Sinks
Activation Functions
Custom SwiGLU with Clamping
FP8-aware fused kernel merged into Transformer Engine
Positional Encodings
YaRN RoPE Scaling
--position-embedding-type yarnwith YaRN configuration parametersMegatron Bridge Support
Megatron Bridge provides full GPT-OSS integration:
GPTOSSProvider20BandGPTOSSProvider120BMegatron Bridge + Megatron-LM Example
PR: #2383 provides end-to-end example scripts covering checkpoint conversion (
convert_mcore_bf16_checkpoint_from_hf.py) and training/fine-tuning (training_gptoss_20b_h100_bf16_fp8.sh)Credits: @cuichenx for core implementation, @yiakwy-xpu-ml-framework-team for example scripts