Publication Detail

Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model

Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)
Methodology for quickly constructing multimodal datasets tailored for Japanese vision-language models.
international
@inproceedings{sasagawa2025llmjp3vila,
  title = {Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model},
  author = {Keito Sasagawa and Koki Maeda and Issa Sugiura and Shuhei Kurita and Naoaki Okazaki and Daisuke Kawahara},
  booktitle = {Proceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)},
  year = {2025},
  address = {Albuquerque, USA},
  publisher = {Association for Computational Linguistics}
}

Abstract

This study addresses the scarcity of multimodal datasets for non-English languages, specifically focusing on Japanese, for visual language models (VLMs). We present an efficient method for rapidly creating comprehensive Japanese multimodal datasets. This involves extracting Japanese image-text pairs from web archives and generating instruction data directly from images using established vision-language models (VLMs). Our datasets significantly enhance the alignment between visual and textual content compared to machine-translated alternatives. Experimental evaluations demonstrate that VLMs trained on our datasets achieve superior accuracy, promoting regional localization and cultural accuracy in multimodal tasks.

Methodology

We introduce a streamlined pipeline for constructing multimodal datasets from scratch:

Experimental Results

llm-jp-3 VILA 14B demonstrated state-of-the-art performance on Japanese benchmarks including Heron-Bench, JA-VLM-Bench-In-the-Wild, and JA-VG-VQA-500. Notably:

Key Contributions

Conclusion and Future Work

Our methodology significantly enriches resources for Japanese VLMs, addressing the critical gap in non-English multimodal datasets. Future research includes expanding dataset diversity and enhancing dataset quality through advanced filtering and synthesis techniques.