Techart Conference

| 2026 TechArt Conference |

Session 3: Virtual Convergence: Expanding the Spectrum of Technological Imagination

What Should Multimodal LLMs Look At in Video? Selective Perception in Video Understanding

Junho Kim

■ Abstract

In this talk, I will discuss a central question in video-based multimodal large language models (MLLMs): what should models actually look at? Despite recent progress, most video models rely on dense visual tokenization, processing large numbers of frames that are often redundant or irrelevant. This leads to inefficient computation and limited understanding of long-form video content. I introduce the concept of selective perception as a guiding principle for addressing this challenge. I will present three key directions: retrieving meaningful temporal segments, learning to select informative frames through reinforcement learning, and enabling streaming-based video understanding for scalable long-context reasoning. These approaches collectively shift the paradigm from processing everything to understanding what matters, providing a foundation for more effective and adaptable video multimodal systems.

■ Bio

Junho Kim is a Postdoctoral Researcher at the University of Illinois Urbana-Champaign, working with Prof. James M. Rehg. He received his Ph.D. from KAIST under the supervision of Prof. Yong Man Ro. His research focuses on large multimodal models, particularly understanding how vision-language systems perceive, reason, and fail in real-world settings. He is interested in improving the reliability, robustness, and interpretability of multimodal AI systems, with applications to video understanding and human-centered AI.

Back

Google Sites

Report abuse

Techart Conference

2026

2025

2024

2023

2022

2021

2020

| 2026 TechArt Conference |