UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
CVPR 2024

Abstract

We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities - image, text, audio, point cloud, thermal, video, and event data. Existing works, e.g., ImageBind, treat the image as the central modality and build an image- centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the down- stream tasks, making it hardly possible to represent the se- mantics of multi-modal data. The "out-of-the-box" insight of our UniBind is to make the alignment centers modality- agnostic and further learn a unified and balanced repre- sentation space, empowered by the large language mod- els (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable perfor- mance boosts. To make this possible, we 1) construct a knowledge base of text with the help of LLMs and multi- modal LLMs; 2) adaptively build LLM-augmented class- wise embedding centers on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding centers via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition perfor- mance gains over prior arts by an average of 6.36%. Fi- nally, we achieve new state-of-the-art performance, e.g., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters

Overall framework of our UniBind

An overview of our UniBind. Firstly, we construct the knowledge base, then learn a unified representation space via LLM-augmented contrastive learning, powered by the knowledge base. Lastly, We utilize the embedding center localized by the knowledge base to obtain the predictions.

vis_res

The results of the emergent Zero-shot and Fine-tuning Recognition on six modalities.

vis_res

The t-SNE visualization of representation space.

vis_res

The t-SNE visualization of embedding centers.

vis_res

BibTeX

             
@article{lyu2024unibind,
  title={UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All},
  author={Lyu, Yuanhuiyi and Zheng, Xu and Zhou, Jiazhou and Wang, Lin},
  journal={arXiv preprint arXiv:2403.12532},
  year={2024}
}