Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction
arXiv preprint · February 2024
VisCE^2, a vision-language-model-based caption evaluation method that replaces human-written references with structured visual context extracted from images.
BibTeX
@misc{maeda2024vision,
title={Vision Language Model-based Caption Evaluation Method Leveraging Visual Context Extraction},
author={Koki Maeda and Shuhei Kurita and Taiki Miyanishi and Naoaki Okazaki},
year={2024},
eprint={2402.17969},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2402.17969}
}
Abstract
Given the accelerating progress of vision and language modeling, accurate evaluation of machine-generated image captions remains critical. Conventional metrics often rely on lexical overlap or embedding similarity and struggle to capture finer quality differences aligned with human judgment.
This paper presents VisCE^2, a vision-language-model-based caption evaluation method that leverages visual context extraction. By structuring image content such as objects, attributes, and relationships, the method replaces human-written references with richer contextual cues and improves caption assessment.
Results
The paper reports that VisCE^2 outperforms conventional pretrained caption-evaluation metrics on multiple meta-evaluation datasets and shows stronger consistency with human preferences.