HT-Step: Aligning Instructional Articles with How-To Videos.

1Fundamental AI Research (FAIR), Meta
HT-Step overview

Abstract

We introduce HT-Step, a large-scale dataset containing temporal annotations of instructional article steps in cooking videos. It includes 116k segment-level annotations over 20k narrated videos (approximately 2.1k hours) of the HowTo100M dataset. Each annotation provides a temporal interval, and a categorical step label from a taxonomy of 4, 958 unique steps automatically mined from wikiHow articles which include rich descriptions of each step. Our dataset significantly surpasses existing labeled step datasets in terms of scale, number of tasks, and richness of natural language step descriptions. Based on these annotations, we introduce a strongly supervised benchmark for aligning instructional articles with how-to videos and present a comprehensive evaluation of baseline methods for this task. By publicly releasing these annotations and defining rigorous evaluation protocols and metrics, we hope to significantly accelerate research in the field of procedural activity understanding.

Annotation examples

BibTeX


      @inproceedings{afouras2023htstep,
          author    = {Afouras, Triantafyllos and Mavroudi, Effrosyni and Nagarajan, Tushar Wang Huiyu and Torresani, Lorenzo},
          title     = {HT-Step: Aligning Instructional Articles with How-To Videos.},
          booktitle = {Neural Information Processing Systems},
          month     = {December},
          year      = {2023},
      }

We thank Mandy Toh, Yale Song, Gene Byrne, Fu-Jen Chu, Austin Miller, and Jiabo Hu for helpful discussions and invaluable engineering support. The website template is borrowed from Nerfies. The instructional videos are from the HowTo100M dataset.