by Huang Huang*, Fangchen Liu*, Letian Fu*, Tingfan Wu, Mustafa Mukadam, Jitendra Malik, Ken Goldberg, Pieter Abbeel at UC Berkeley and Meta (*equal contribution).
[Paper] | [Project Page]
This repo contains the official implementation for Otter: A Vision-Language-Action Model with Text-Aware Feature Extraciton. We also released a Pytorch Implementation.
Further information please contact Huang Huang, Fangchen Liu, Letian Fu, or post an issue on Github!
- 2025-03-05: Initial code release.
- WIP: instructions on training, inference.
- WIP: release pretrained models.
python scripts/train.py --config.save_dir=<...>
Experimental things and training/eval scripts should go in experiments/<your_name>. To make any changes to files outside of your experiments directory, please open a pull request.
To enable code checks and auto-formatting, please install pre-commit hooks:
pre-commit install
conda create -n otter_jax python=3.10
conda activate otter_jax
pip install -e .
pip install -r requirements.txt
conda install -c conda-forge cudatoolkit=11.8
conda install -c conda-forge cudnn=8.9
For GPU:
pip install --upgrade "jax[cuda11_pip]==0.4.20" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
For TPU
pip install --upgrade "jax[tpu]==0.4.20" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html
See the Jax Github page for more details on installing Jax.
This project is under the Apache 2.0 license. See LICENSE for details.
We thank the authors of Octo for providing an easy-to-use codebase for training vision-language-action models.
Please give us a star 🌟 on Github to support us!
Please cite our work if you find our work inspiring or use our code in your work:
@article{huang2025otter,
title={Otter: A Vision-Language-Action Model with Text-Aware Feature Extraciton},
author={Huang Huang and Fangchen Liu and Letian Fu and Tingfan Wu and Mustafa Mukadam and Jitendra Malik and Ken Goldberg and Pieter Abbeel},
journal={arXiv preprint arXiv:2503.15980},
year={2025}
}