Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?

Abstract

Multimodal foundation models have paved the way for a paradigm shift in long video understanding. The extent to which these models can help analyze verbal and non-verbal behaviors in the context of human interactions is underexplored, particularly in the challenging settings of clinical diagnosis and treatment. We investigate the use of foundation models across speech, video, and text modalities to analyze child-focused interactions in the context of autism diagnosis. We evaluate model performance in two related tasks, ie activity understanding and atypical behavior detection. We further propose a unified methodology for merging information from audio and video streams by leveraging large language models as reasoning agents. Our experiments reveal that, while models perform relatively well for coarse-grained tasks such as activity recognition and over-activity identification, they fail to generalize to fine-grained tasks such as anxiety detection and activity segmentation.

Date: September 7, 2025
Authors: Aditya Kommineni, Digbalay Bose, Tiantian Feng, So Hyun Kim, Helen Tager-Flusberg, Somer Bishop, Catherine Lord, Sudarsana Kadiri, Shrikanth Narayanan
Conference: Proc. Interspeech 2025
Pages: 3050-3054

View Paper

Information Sciences Institute

Publications

Can Multimodal Foundation Models Help Analyze Child-Inclusive Autism Diagnostic Videos?

Abstract