Query-based Image Captioning from Multi-context 360-degree Images
Findings of the Association for Computational Linguistics: EMNLP 2023 · December 2023
A novel image captioning approach that leverages queries and multi-context 360-degree imagery.
BibTeX
@inproceedings{maeda2023quic360,
title = {Query-based Image Captioning from Multi-context 360-degree Images},
author = {Koki Maeda and Shuhei Kurita and Taiki Miyanishi and Naoaki Okazaki},
booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023},
pages = {6940--6954},
year = {2023},
address = {Singapore},
publisher = {Association for Computational Linguistics}
}
Abstract
This paper introduces Query-based Image Captioning (QuIC) for 360-degree images, where a query (words or short phrases) specifies the context to describe. This task is more challenging than conventional image captioning, as it requires fine-grained scene understanding to select content consistent with the user’s intent based on the query. We constructed a dataset comprising 3,940 360-degree images and 18,459 pairs of queries and captions annotated manually. Experiments demonstrate that fine-tuning image captioning models on our dataset can generate more diverse and controllable captions from multiple contexts of 360-degree images.
Methodology
The proposed QuIC task involves generating captions for 360-degree images based on user-specified queries. We developed a dataset with manually annotated query-caption pairs to facilitate this task. The methodology includes fine-tuning existing image captioning models on this dataset to enable them to produce contextually relevant captions corresponding to the provided queries.
Experimental Results
Experiments show that models fine-tuned on our dataset can generate more diverse and controllable captions for 360-degree images. The results indicate that the QuIC approach effectively captures multiple contexts within 360-degree imagery, aligning with user-specified queries.
Key Contributions
-
Introduction of QuIC Task: We propose the novel task of Query-based Image Captioning for 360-degree images, addressing the challenge of describing specific contexts within panoramic imagery.
-
Dataset Construction: We created a dataset consisting of 3,940 360-degree images and 18,459 manually annotated query-caption pairs to support research in this area.
-
Model Fine-tuning: We demonstrate that fine-tuning image captioning models on our dataset enables the generation of diverse and contextually relevant captions based on user queries.
Conclusion and Future Work
The QuIC task presents a promising direction for enhancing image captioning models to handle user-specified contexts within 360-degree images. Future work may explore the application of QuIC to other types of imagery and investigate more sophisticated models capable of understanding and generating captions for complex scenes.