LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

1 UNC Charlotte 2 Salesforce AI Research 3 Inria 4 Université Côte d'Azur
Our paper's teaser image

Comparison of LLVM and our proposed LLAVIDAL for understanding Activities of Daily Living. In real world scenarios, web-video trained models struggle to understand the fine-grained details and human-object interactions present in Activities of Daily Living. In contrast, LLAVIDAL is trained on a curated ADL dataset called ADL-X and incorporates specialized modalities (3D human skeleton data and human-object interaction) into its training, enabling more accurate interpretation of daily activities.

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multiview, multimodal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks.

Quantitative Results

Quantitative Result 7

Impact of ADL-X Training

Quantitative Result 2

ADL MCQ - Action Recognition

Quantitative Result 3

ADL MCQ - Temporal Completion

Quantitative Result 4

Effect of Introduction of Skeleton and Object Cues on ADL MCQ - Action Recognition

Quantitative Result 5

Effect of Introduction of Skeleton and Object Cues on ADL MCQ - Temporal Completion

Quantitative Result 6

Effect of Introduction of Skeleton and Object Cues on Toyota Smarthome Untrimmed Video Descriptions

Qualitative Results

BibTeX


        @article{llavidal2024,
          title={LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living}, 
          author={Dominick Reilly and Rajatsubhra Chakraborty and Arkaprava Sinha and Manish Kumar Govind and Pu Wang and Francois Bremond and Le Xue and Srijan Das},
          journal={arXiv},
          year={2024},
          volume={2406.09390}
        } 
      

Usage License

The dataset is protected under the CC-BY license of Creative Commons, which allows users to distribute, remix, adapt, and build upon the material in any medium or format, as long as the creator is attributed. The license allows ADL-X for commercial use. As the authors of this manuscript and collectors of this dataset, we reserve the right to distribute the data.