COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset as a Vision-Language Benchmark

Koki Maeda, Tosho Hirasawa, Atsushi Hashimoto, Jun Harashima, Leszek Rybicki, Yusuke Fukasawa, Yoshitaka Ushiku

Proceedings of The 18th European Conference on Computer Vision (ECCV 2024) · September 2024

Introducing a new vision-language dataset based on unedited overhead-view procedural cooking videos.

BibTeX

@inproceedings{maeda2024comkitchens,
  title = {COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset as a Vision-Language Benchmark},
  author = {Koki Maeda and Tosho Hirasawa and Atsushi Hashimoto and Jun Harashima and Leszek Rybicki and Yusuke Fukasawa and Yoshitaka Ushiku},
  booktitle = {Proceedings of The 18th European Conference on Computer Vision (ECCV 2024)},
  year = {2024},
  address = {Milan, Italy},
  publisher = {ECCV}
}

Abstract

COM Kitchens introduces a novel vision-language dataset aimed at overcoming the limitations of current procedural video datasets, typically sourced from the web or ego-centric views. The dataset comprises unedited, overhead-view cooking videos captured using modern smartphones, providing environmental diversity and detailed annotations. We introduce two novel vision-language tasks: Online Recipe Retrieval (OnRR) and Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our benchmarks demonstrate the dataset’s capacity to highlight limitations in existing models, thus paving the way for future research.