This study addresses the scarcity of multimodal datasets for non-English languages, specifically focusing on Japanese, for visual language models (VLMs). We present an efficient method for rapidly creating comprehensive Japanese multimodal datasets. This involves extracting Japanese image-text pairs from web archives and generating instruction data directly from images using established vision-language models (VLMs). Our datasets significantly enhance the alignment between visual and textual content compared to machine-translated alternatives. Experimental evaluations demonstrate that VLMs trained on our datasets achieve superior accuracy, promoting regional localization and cultural accuracy in multimodal tasks.
We introduce a streamlined pipeline for constructing multimodal datasets from scratch:
llm-jp-3 VILA 14B demonstrated state-of-the-art performance on Japanese benchmarks including Heron-Bench, JA-VLM-Bench-In-the-Wild, and JA-VG-VQA-500. Notably:
Our methodology significantly enriches resources for Japanese VLMs, addressing the critical gap in non-English multimodal datasets. Future research includes expanding dataset diversity and enhancing dataset quality through advanced filtering and synthesis techniques.