HT-Step: Aligning Instructional Articles with How-To Videos

Abstract

We introduce HT-Step, a large-scale dataset containing temporal annotations of instructional article steps in cooking videos. It includes 116k segment-level annotations over 20k narrated videos (approximately 2.1k hours) of the HowTo100M dataset. Each annotation provides a temporal interval, and a categorical step label from a taxonomy of 4, 958 unique steps automatically mined from wikiHow articles which include rich descriptions of each step. Our dataset significantly surpasses existing labeled step datasets in terms of scale, number of tasks, and richness of natural language step descriptions. Based on these annotations, we introduce a strongly supervised benchmark for aligning instructional articles with how-to videos and present a comprehensive evaluation of baseline methods for this task. By publicly releasing these annotations and defining rigorous evaluation protocols and metrics, we hope to significantly accelerate research in the field of procedural activity understanding.

Annotation examples

BibTeX


      @inproceedings{afouras2023htstep,
          author    = {Afouras, Triantafyllos and Mavroudi, Effrosyni and Nagarajan, Tushar Wang Huiyu and Torresani, Lorenzo},
          title     = {HT-Step: Aligning Instructional Articles with How-To Videos.},
          booktitle = {Neural Information Processing Systems},
          month     = {December},
          year      = {2023},
      }