From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models

Masanari Ohi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

International Conference on Machine Learning (ICML 2026) · February 2026

Proposes a framework for evaluating human-like multi-image spatial reasoning in multi-modal large language models, analyzing how models establish correspondence across images and translate spatial understanding into actions.

BibTeX

@inproceedings{ohi2026hatch,
  title={From Correspondence to Actions: Human-Like Multi-Image Spatial Reasoning in Multi-modal Large Language Models},
  author={Masanari Ohi and Koki Maeda and Ryuto Koike and Daisuke Oba and Nakamasa Inoue and Naoaki Okazaki},
  booktitle={International Conference on Machine Learning (ICML 2026)},
  year={2026},
  eprint={2602.08735},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.08735}
}