Publications
Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?
Abstract
Multimodal foundation models have paved the way for a paradigm shift in long video understanding. The extent to which these models can help analyze verbal and non-verbal behaviors in the context of human interactions is underexplored, particularly in the challenging settings of clinical diagnosis and treatment. We investigate the use of foundation models across speech, video, and text modalities to analyze child-focused interactions in the context of autism diagnosis. We evaluate model performance in two related tasks, ie activity understanding and atypical behavior detection. We further propose a unified methodology for merging information from audio and video streams by leveraging large language models as reasoning agents. Our experiments reveal that, while models perform relatively well for coarse-grained tasks such as activity recognition and over-activity identification, they fail to generalize to fine-grained tasks such as anxiety detection and activity segmentation.
- Date
- September 7, 2025
- Authors
- Aditya Kommineni, Digbalay Bose, Tiantian Feng, So Hyun Kim, Helen Tager-Flusberg, Somer Bishop, Catherine Lord, Sudarsana Kadiri, Shrikanth Narayanan
- Conference
- Proc. Interspeech 2025
- Pages
- 3050-3054