HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos

Abstract

An overarching goal for computer-aided perception systems is the holistic understanding of the human-centric 3D world, including faithful reconstructions of humans, scenes, and their global spatial relationships. While recent progress in monocular 3D reconstruction has been made for footage of either humans or scenes alone, the joint reconstruction of both humans and scenes, along with their global spatial information, remains an unsolved challenge. To address this, we introduce a novel and unified framework that simultaneously achieves temporally and spatially coherent 3D reconstruction of static scenes with dynamic humans from monocular RGB videos. Specifically, we parameterize temporally consistent canonical human models and static scene representations using two neural fields in a shared 3D space. Additionally, we develop a global optimization framework that considers physical constraints imposed by potential human-scene interpenetration and occlusion. Compared to separate reconstructions, our framework enables detailed and holistic geometry reconstructions of both humans and scenes. Furthermore, we introduce a synthetic dataset for quantitative evaluations. Extensive experiments and ablation studies on both real-world and synthetic videos demonstrate the efficacy of our framework in monocular human-scene reconstruction.

Video

Dataset

Please fill out the data request form to access the dataset containing human subjects.

HSR: Holistic 3D Human-Scene Reconstruction from Monocular Videos

HSR jointly reconstructs dynamic humans and static scenes from monocular RGB videos.

Abstract

Video

Dataset

BibTeX