UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
CVPR 2024
-
Yuanhuiyi Lyu*
AI Thrust, HKUST(GZ)
    -
Xu Zheng*
AI Thrust, HKUST(GZ)
    -
Jiazhou Zhou
AI Thrust, HKUST(GZ)
    -
Addison Lin Wang
AI & CMA Thrust, HKUST(GZ)
Dept. of CSE, HKUST
Abstract
We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities - image, text, audio, point cloud, thermal, video, and event data. Existing works, e.g., ImageBind, treat the image as the central modality and build an image- centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the down- stream tasks, making it hardly possible to represent the se- mantics of multi-modal data. The "out-of-the-box" insight of our UniBind is to make the alignment centers modality- agnostic and further learn a unified and balanced repre- sentation space, empowered by the large language mod- els (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable perfor- mance boosts. To make this possible, we 1) construct a knowledge base of text with the help of LLMs and multi- modal LLMs; 2) adaptively build LLM-augmented class- wise embedding centers on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding centers via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition perfor- mance gains over prior arts by an average of 6.36%. Fi- nally, we achieve new state-of-the-art performance, e.g., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters
Overall framework of our UniBind
An overview of our UniBind. Firstly, we construct the knowledge base, then learn a unified representation space via LLM-augmented contrastive learning, powered by the knowledge base. Lastly, We utilize the embedding center localized by the knowledge base to obtain the predictions.
The results of the emergent Zero-shot and Fine-tuning Recognition on six modalities.
The t-SNE visualization of representation space.
The t-SNE visualization of embedding centers.
BibTeX
@article{lyu2024unibind, title={UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All}, author={Lyu, Yuanhuiyi and Zheng, Xu and Zhou, Jiazhou and Wang, Lin}, journal={arXiv preprint arXiv:2403.12532}, year={2024} }