From Spatial AI to Physical AI
Sunghwan
■ Abstract
Building machines that perceive, understand, and interact with the 3D physical world remains a defining challenge of artificial intelligence. In this lecture, I trace a research arc from Spatial AI — systems that reconstruct and interpret 3D environments from visual input — toward Physical AI, where intelligent agents reason about and act within the physical world. I begin with the problem of visual correspondence, showing how transformer-based cost aggregation established a principled foundation for dense matching across viewpoints and domains. I then describe how this machinery enabled a shift toward unified, pose-free 3D reconstruction: jointly optimizing correspondence, camera pose estimation, and novel view synthesis to build generalizable systems that operate from sparse, uncalibrated imagery without per-scene optimization. Moving beyond reconstruction, I discuss recent work on compact 3D scene representations and language-grounded 3D understanding that equip agents not just with geometric maps, but with structured, semantically meaningful models of their surroundings. Finally, I outline the path ahead — persistent 3D world models, generative 4D simulation, and closed-loop embodied systems — arguing that the transition from spatial perception to physical intelligence requires tightly coupling reconstruction, understanding, and action within a single framework.
■ Bio
Sunghwan Hong is a postdoctoral researcher at ETH Zurich, where he works on computer vision and Spatial AI. His research focuses on 3D reconstruction, scene understanding, visual correspondence, and multimodal learning. He is particularly interested in building reliable perception systems that connect visual understanding with real-world reasoning and interaction. His long-term goal is to advance next-generation AI for embodied and physically grounded intelligence.