COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset as a Vision-Language Benchmark
Proceedings of The 18th European Conference on Computer Vision (ECCV 2024) · September 2024
Introducing a new vision-language dataset based on unedited overhead-view procedural cooking videos.
BibTeX
@inproceedings{maeda2024comkitchens,
title = {COM Kitchens: An Unedited Overhead-view Procedural Videos Dataset as a Vision-Language Benchmark},
author = {Koki Maeda and Tosho Hirasawa and Atsushi Hashimoto and Jun Harashima and Leszek Rybicki and Yusuke Fukasawa and Yoshitaka Ushiku},
booktitle = {Proceedings of The 18th European Conference on Computer Vision (ECCV 2024)},
year = {2024},
address = {Milan, Italy},
publisher = {ECCV}
}
Abstract
COM Kitchens introduces a novel vision-language dataset aimed at overcoming the limitations of current procedural video datasets, typically sourced from the web or ego-centric views. The dataset comprises unedited, overhead-view cooking videos captured using modern smartphones, providing environmental diversity and detailed annotations. We introduce two novel vision-language tasks: Online Recipe Retrieval (OnRR) and Dense Video Captioning on unedited Overhead-View videos (DVC-OV). Our benchmarks demonstrate the dataset’s capacity to highlight limitations in existing models, thus paving the way for future research.