DueT: Image-Text Contrastive Transfer Learning with Dual-adapter Tuning

Taku Hasegawa, Kyosuke Nishida, Koki Maeda, Kuniko Saito

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023) · December 2023

Vision-language models like CLIP show strong zero-shot performance but struggle when fine-tuned on downstream tasks due to overfitting. This paper proposes DueT (Dual-adapter Tuning), which uses separate adapters for uni-modal and cross-modal features to prevent overfitting while maintaining the pre-trained knowledge. The method introduces contrastive learning between adapted and original features, achieving state-of-the-art results on multiple vision-language benchmarks. DueT demonstrates significant improvements over existing adapter-based methods, particularly in few-shot scenarios where overfitting is most problematic.

BibTeX

@inproceedings{hasegawa2023duet,
  title = {DueT: Image-Text Contrastive Transfer Learning with Dual-adapter Tuning},
  author = {Taku Hasegawa and Kyosuke Nishida and Koki Maeda and Kuniko Saito},
  booktitle = {Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)},
  pages = {13607--13624},
  year = {2023},
  address = {Singapore},
  publisher = {Association for Computational Linguistics}
}

Abstract

This paper presents DueT, a novel transfer learning method for vision and language models built by contrastive learning. In DueT, adapters are inserted into the image and text encoders, which have been initialized using models pre-trained on uni-modal corpora and then frozen. By training only these adapters, DueT enables efficient learning with a reduced number of trainable parameters. Moreover, unlike traditional adapters, those in DueT are equipped with a gating mechanism, enabling effective transfer and connection of knowledge acquired from pre-trained uni-modal encoders while preventing catastrophic forgetting. We report that DueT outperformed simple fine-tuning, the conventional method fixing only the image encoder and training only the text encoder, and the LoRA-based adapter method in accuracy and parameter efficiency for 0-shot image and text retrieval in both English and Japanese domains.

Methodology

DueT employs a dual-adapter tuning approach where adapters with gating mechanisms are inserted into both the image and text encoders. These encoders are initialized with models pre-trained on uni-modal corpora and kept frozen during training. Only the adapters are trained, which allows for efficient learning with fewer parameters. The gating mechanism facilitates the effective transfer of knowledge from the pre-trained encoders while preventing catastrophic forgetting.

Experimental Results

Experiments conducted in both English and Japanese domains demonstrated that DueT outperforms several baseline methods. Specifically, DueT achieved higher accuracy and better parameter efficiency compared to simple fine-tuning, the conventional method of fixing only the image encoder and training only the text encoder, and the LoRA-based adapter method in zero-shot image and text retrieval tasks.

Key Contributions

  • Transferability: DueT effectively transfers and connects knowledge acquired by pre-trained uni-modal encoders without catastrophic forgetting.
  • Parameter Efficiency: The method achieves superior performance in image-text pre-training tasks while utilizing fewer trainable parameters.

Conclusion and Future Work

DueT presents a promising approach to transfer learning in vision and language models by leveraging dual-adapter tuning with gating mechanisms. Future work may explore the application of DueT to other languages and domains, as well as its integration with different model architectures and training objectives.