From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models
arXiv preprint · February 2026
Proposes a framework for evaluating human-like multi-image spatial reasoning in multi-modal large language models, analyzing how models establish correspondence across images and translate spatial understanding into actions.
BibTeX
@misc{ohi2026hatch,
title={From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models},
author={Masanari Ohi and Koki Maeda and Ryuto Koike and Daisuke Oba and Nakamasa Inoue and Naoaki Okazaki},
year={2026},
eprint={2602.08735},
archivePrefix={arXiv},
primaryClass={cs.CV}
}