Query-based Image Captioning from Multi-context 360-degree Images

Koki Maeda; Shuhei Kurita; Taiki Miyanishi; Naoaki Okazaki

Query-based Image Captioning from Multi-context 360-degree Images

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

Findings of the Association for Computational Linguistics: EMNLP 2023 · December 2023

A novel image captioning approach that leverages queries and multi-context 360-degree imagery.

PDF Code

BibTeX

@inproceedings{maeda2023quic360,
  title = {Query-based Image Captioning from Multi-context 360-degree Images},
  author = {Koki Maeda and Shuhei Kurita and Taiki Miyanishi and Naoaki Okazaki},
  booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages = {6940--6954},
  year = {2023},
  address = {Singapore},
  publisher = {Association for Computational Linguistics}
}

PDF

Abstract

This paper introduces Query-based Image Captioning (QuIC) for 360-degree images, where a query (words or short phrases) specifies the context to describe. This task is more challenging than conventional image captioning, as it requires fine-grained scene understanding to select content consistent with the user’s intent based on the query. We constructed a dataset comprising 3,940 360-degree images and 18,459 pairs of queries and captions annotated manually. Experiments demonstrate that fine-tuning image captioning models on our dataset can generate more diverse and controllable captions from multiple contexts of 360-degree images.

Methodology

The proposed QuIC task involves generating captions for 360-degree images based on user-specified queries. We developed a dataset with manually annotated query-caption pairs to facilitate this task. The methodology includes fine-tuning existing image captioning models on this dataset to enable them to produce contextually relevant captions corresponding to the provided queries.

Experimental Results

Experiments show that models fine-tuned on our dataset can generate more diverse and controllable captions for 360-degree images. The results indicate that the QuIC approach effectively captures multiple contexts within 360-degree imagery, aligning with user-specified queries.

Key Contributions

Introduction of QuIC Task: We propose the novel task of Query-based Image Captioning for 360-degree images, addressing the challenge of describing specific contexts within panoramic imagery.
Dataset Construction: We created a dataset consisting of 3,940 360-degree images and 18,459 manually annotated query-caption pairs to support research in this area.
Model Fine-tuning: We demonstrate that fine-tuning image captioning models on our dataset enables the generation of diverse and contextually relevant captions based on user queries.

Conclusion and Future Work

The QuIC task presents a promising direction for enhancing image captioning models to handle user-specified contexts within 360-degree images. Future work may explore the application of QuIC to other types of imagery and investigate more sophisticated models capable of understanding and generating captions for complex scenes.