Publications
Can text-to-image model assist multi-modal learning for visual recognition with visual modality missing?
Abstract
Multi-modal learning has emerged as an increasingly promising avenue in vision recognition, driving innovations across diverse domains. Despite its success, the robustness of multi-modal learning for visual recognition is often challenged by the unavailability of a subset of modalities, especially the visual modality. Conventional approaches to mitigate missing modalities in multi-modal learning rely heavily on modality fusion schemes. In contrast, this paper explores the use of text-to-image models to assist multi-modal learning. Specifically, we propose and explore a simple but effective multi-modal learning framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality by imputing the missing data with generative models. Using multiple multi-modal datasets with visual recognition tasks, we present a comprehensive analysis of diverse conditions involving missing visual …
Metadata
- publication
- Proceedings of the 26th International Conference on Multimodal Interaction …, 2024
- year
- 2024
- publication date
- 2024/11/4
- authors
- Tiantian Feng, Daniel Yang, Digbalay Bose, Shrikanth Narayanan
- link
- https://dl.acm.org/doi/abs/10.1145/3678957.3685725
- resource_link
- https://dl.acm.org/doi/pdf/10.1145/3678957.3685725
- book
- Proceedings of the 26th International Conference on Multimodal Interaction
- pages
- 124-133