What Should Multimodal LLMs Look At in Video? Selective Perception in Video Understanding
Junho Kim
■ Abstract
In this talk, I will discuss a central question in video-based multimodal large language models (MLLMs): what should models actually look at? Despite recent progress, most video models rely on dense visual tokenization, processing large numbers of frames that are often redundant or irrelevant. This leads to inefficient computation and limited understanding of long-form video content. I introduce the concept of selective perception as a guiding principle for addressing this challenge. I will present three key directions: retrieving meaningful temporal segments, learning to select informative frames through reinforcement learning, and enabling streaming-based video understanding for scalable long-context reasoning. These approaches collectively shift the paradigm from processing everything to understanding what matters, providing a foundation for more effective and adaptable video multimodal systems.
■ Bio
Junho Kim is a Postdoctoral Researcher at the University of Illinois Urbana-Champaign, working with Prof. James M. Rehg. He received his Ph.D. from KAIST under the supervision of Prof. Yong Man Ro. His research focuses on large multimodal models, particularly understanding how vision-language systems perceive, reason, and fail in real-world settings. He is interested in improving the reliability, robustness, and interpretability of multimodal AI systems, with applications to video understanding and human-centered AI.