Abstract
We introduce ViSER, a method for recovering articulated 3D shapes and dense 3D trajectories from monocular videos.
Previous work on high-quality reconstruction of dynamic 3D shapes typically relies on multiple camera views,
strong category-specific priors, or 2D keypoint supervision.
We show that none of these are required if one can reliably estimate long-range 2D point correspondences, making
use of only 2D object masks and two-frame optical flow as inputs.
ViSER infers correspondences by matching 2D pixels to a canonical, deformable 3D mesh via video-specific
surface embeddings that capture the pixel appearance of each surface point.
These embeddings behave as a continous set of keypoint descriptors defined over the mesh surface, which can be
used to establish dense long-range correspondences across pixels.
The surface embeddings are implemented via coordinate-based MLPs that are fit to each video via contrastive
reconstruction losses.
Experimental results show that ViSER compares favorably against prior work on challenging videos of humans with
loose clothing and unusual poses as well as animals videos from DAVIS and YTVOS.
Bibtex
@inproceedings{yang2021viser,
title={ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction},
author={Yang, Gengshan
and Sun, Deqing
and Jampani, Varun
and Vlasic, Daniel
and Cole, Forrester
and Liu, Ce
and Ramanan, Deva},
booktitle = {NeurIPS},
year={2021}
}
Acknowledgments
This work was supported by Google Cloud Platform (GCP) awards received from Google and the CMU Argo AI Center for
Autonomous Vehicle Research. We thank William T. Freeman and many others from CMU and Google for providing
valuable feedback.