LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living

♦ UNC Charlotte ,† Inria,† Université Côte d'Azur

*Indicates Equal Contribution

Comparison of LLVM vs LLAVIDAL : In real world scenarios, web-video trained models struggle to understand Activities of Daily Living due to the subtle nuances in the video, whereas our ADL-X trained LLAVIDAL model triumphs in understanding complex human-object interactions.

Abstract

Large Language Vision Models (LLVMs) have demonstrated effectiveness in processing internet videos, yet they struggle with the visually perplexing dynamics present in Activities of Daily Living (ADL) due to limited pertinent datasets and models tailored to relevant cues. To this end, we propose a framework for curating ADL multiview datasets to fine-tune LLVMs, resulting in the creation of ADL-X, comprising 100K RGB video-instruction pairs, language descriptions, 3D skeletons, and action-conditioned object trajectories. We introduce LLAVIDAL, an LLVM capable of incorporating 3D poses and relevant object trajectories to understand the intricate spatiotemporal relationships within ADLs. Furthermore, we present a novel benchmark, ADLMCQ, for quantifying LLVM effectiveness in ADL scenarios. When trained on ADL-X, LLAVIDAL consistently achieves state-of-the-art performance across all ADL evaluation metrics. Qualitative analysis reveals LLAVIDAL's temporal reasoning capabilities in understanding ADL.

Quantitative Results

Quantitative Result 7

Impact of ADL-X Training

Quantitative Result 2

ADLMCQ - Action Recognition

Quantitative Result 3

ADLMCQ - Action Forecasting

Quantitative Result 4

Effect of Introduction of Pose and Object Cues on ADLMCQ Action Recognition

Quantitative Result 5

Effect of Introduction of Pose and Object Cues on ADLMCQ Action Forecasting

Quantitative Result 6

Effect of Introduction of Pose and Object Cues on ADLMCQ Action Description

Qualitative Results

BibTeX


        @misc{chakraborty2024llavidal,
          title={LLAVIDAL: Benchmarking Large Language Vision Models for Daily Activities of Living}, 
          author={Rajatsubhra Chakraborty and Arkaprava Sinha and Dominick Reilly and Manish Kumar Govind and Pu Wang and Francois Bremond and Srijan Das},
          year={2024},
          eprint={2406.09390},
          archivePrefix={arXiv},
          primaryClass={id='cs.CV' full_name='Computer Vision and Pattern Recognition' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers image processing, computer vision, pattern recognition, and scene understanding. Roughly includes material in ACM Subject Classes I.2.10, I.4, and I.5.'}
        } 
      

Usage License

The dataset is protected under the CC-BY license of Creative Commons, which allows users to distribute, remix, adapt, and build upon the material in any medium or format, as long as the creator is attributed. The license allows ADL-X for commercial use. As the authors of this manuscript and collectors of this dataset, we reserve the right to distribute the data.