Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model
Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations) · June 2025
Methodology for quickly constructing multimodal datasets tailored for Japanese vision-language models.
BibTeX
@inproceedings{sasagawa2025llmjp3vila,
title = {Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model},
author = {Keito Sasagawa and Koki Maeda and Issa Sugiura and Shuhei Kurita and Naoaki Okazaki and Daisuke Kawahara},
booktitle = {Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)},
year = {2025},
address = {Albuquerque, USA},
publisher = {Association for Computational Linguistics}
}
Abstract
This study addresses the scarcity of multimodal datasets for non-English languages, specifically focusing on Japanese, for visual language models (VLMs). We present an efficient method for rapidly creating comprehensive Japanese multimodal datasets. This involves extracting Japanese image-text pairs from web archives and generating instruction data directly from images using established vision-language models (VLMs). Our datasets significantly enhance the alignment between visual and textual content compared to machine-translated alternatives. Experimental evaluations demonstrate that VLMs trained on our datasets achieve superior accuracy, promoting regional localization and cultural accuracy in multimodal tasks.