The research rapidly advances vision-language models (VLM), but evaluation frameworks for Japanese vision-language (V&L) tasks are still inadequate. This paper introduces llm-jp-eval-mm, a toolkit for systematically evaluating Japanese multimodal tasks. It unifies six existing Japanese multimodal tasks, enabling consistent benchmarking across multiple metrics. The toolkit is publicly available, aiming to facilitate continuous improvement and evaluation of Japanese VLMs.
llm-jp-eval-mm standardizes input/output formats across diverse datasets, separating inference from evaluation. It uses a modular class structure (Task and Scorer) allowing easy extension and incorporation of new tasks and models. Evaluation setups are simplified, removing YAML-based configurations, thus enhancing maintainability and usability.
Evaluations were performed on 13 publicly available Japanese and multilingual VLMs. Among Japanese-specialized models, llm-jp-3 VILA exhibited top performance across most tasks. For multilingual models, Qwen2-VL showed superior results, particularly in multi-image tasks, indicating advantages of advanced training strategies.
llm-jp-eval-mm is the pioneering framework for systematic Japanese VLM evaluation, revealing both the advancement and remaining gaps compared to large-scale commercial models like GPT-4o. Future work includes expanding dataset diversity (e.g., specialized domains, image generation) and extending evaluations to other modalities such as 3D vision, audio, video, and Vision-Language-Action (VLA).