Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction

Koki Maeda, Shuhei Kurita, Taiki Miyanishi, Naoaki Okazaki

arXiv preprint · February 2024

VisCE^2, a vision-language-model-based caption evaluation method that replaces human-written references with structured visual context extracted from images.

BibTeX

@misc{maeda2024vision,
  title={Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction},
  author={Koki Maeda and Shuhei Kurita and Taiki Miyanishi and Naoaki Okazaki},
  year={2024},
  eprint={2402.17969},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2402.17969}
}

Abstract

Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. Conventional metrics often rely on lexical overlap or embedding similarity and struggle to capture finer quality differences aligned with human judgment.

This paper presents VisCE^2, a vision-language-model-based caption evaluation method that leverages visual context extraction. By structuring image content such as objects, attributes, and relationships, the method replaces human-written references with richer contextual cues and improves caption assessment.

Results

The paper reports that VisCE^2 outperforms conventional pretrained caption-evaluation metrics on multiple meta-evaluation datasets and shows stronger consistency with human preferences.