Publications

Can text-to-image model assist multi-modal learning for visual recognition with visual modality missing?

Abstract

Multi-modal learning has emerged as an increasingly promising avenue in vision recognition, driving innovations across diverse domains. Despite its success, the robustness of multi-modal learning for visual recognition is often challenged by the unavailability of a subset of modalities, especially the visual modality. Conventional approaches to mitigate missing modalities in multi-modal learning rely heavily on modality fusion schemes. In contrast, this paper explores the use of text-to-image models to assist multi-modal learning. Specifically, we propose and explore a simple but effective multi-modal learning framework GTI-MM to enhance the data efficiency and model robustness against missing visual modality by imputing the missing data with generative models. Using multiple multi-modal datasets with visual recognition tasks, we present a comprehensive analysis of diverse conditions involving missing visual …

Metadata

publication
Proceedings of the 26th International Conference on Multimodal Interaction …, 2024
year
2024
publication date
2024/11/4
authors
Tiantian Feng, Daniel Yang, Digbalay Bose, Shrikanth Narayanan
link
https://dl.acm.org/doi/abs/10.1145/3678957.3685725
resource_link
https://dl.acm.org/doi/pdf/10.1145/3678957.3685725
book
Proceedings of the 26th International Conference on Multimodal Interaction
pages
124-133