ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction

NeurIPS 2021

Gengshan Yang¹ Deqing Sun² Varun Jampani² Daniel Vlasic² Forrester Cole²

Ce Liu⁴ Deva Ramanan^1,3

¹Carnegie Mellon University ²Google Research ³Argo AI ⁴Microsoft Azure AI

Given a long video or multiple short videos, we jointly learn articulated 3D shapes and a joint pixel-surface embedding to establish dense correspondences over video frames. As a result, accurate shape, long term trajectory and meaningful part segmentation can be recovered, without using a pre-defined shape template.

Abstract

We introduce ViSER, a method for recovering articulated 3D shapes and dense 3D trajectories from monocular videos. Previous work on high-quality reconstruction of dynamic 3D shapes typically relies on multiple camera views, strong category-specific priors, or 2D keypoint supervision. We show that none of these are required if one can reliably estimate long-range 2D point correspondences, making use of only 2D object masks and two-frame optical flow as inputs. ViSER infers correspondences by matching 2D pixels to a canonical, deformable 3D mesh via video-specific surface embeddings that capture the pixel appearance of each surface point. These embeddings behave as a continous set of keypoint descriptors defined over the mesh surface, which can be used to establish dense long-range correspondences across pixels. The surface embeddings are implemented via coordinate-based MLPs that are fit to each video via contrastive reconstruction losses. Experimental results show that ViSER compares favorably against prior work on challenging videos of humans with loose clothing and unusual poses as well as animals videos from DAVIS and YTVOS.

[Paper] [Code] [Poster] [Slides]

Bibtex

@inproceedings{yang2021viser, title={ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction}, author={Yang, Gengshan and Sun, Deqing and Jampani, Varun and Vlasic, Daniel and Cole, Forrester and Liu, Ce and Ramanan, Deva}, booktitle = {NeurIPS}, year={2021} }

Short video

Long video

Results

Comparisons [More]

DAVIS-dance-twirl (90 frames). From left to right: reference video, results of LASR, results of VIBE+SMPLify, and results of ViSER. Top: reconstructed 3D shape. Bottom: reconstructed 3D shape at 1st frame.

Results on athletic human [More]

DAVIS-breakdance-flare. Top left: reference video. Top middle: reconstructed nonrigid 3D shape. Top right: reconstruction 3D shape at 1st frame. Bottom left: trajectory. Bottom middle: part segmnetation. Bottom right: textured mesh.

Results on YTVOS-elephants [More]

Elephant-1. Top left: reference video. Top middle: reconstructed nonrigid 3D shape. Top right: reconstruction 3D shape at 1st frame. Bottom left: trajectory. Bottom middle: part segmnetation. Bottom right: textured mesh.

Related projects

LASR: Learning Articulated Shape Reconstruction from a Monocular Video. CVPR 2021.
Continuous Surface Embeddings. NeurIPS 2020.
DOVE: Learning Deformable 3D Objects by Watching Videos. arXiv preprint.
Self-supervised Single-view 3D Reconstruction via Semantic Consistency. ECCV 2020.
Shape and Viewpoints without Keypoints. ECCV. 2020.
Articulation Aware Canonical Surface Mapping. CVPR 2020.
Learning Category-Specific Mesh Reconstruction from Image Collections. ECCV 2018.

Acknowledgments

This work was supported by Google Cloud Platform (GCP) awards received from Google and the CMU Argo AI Center for Autonomous Vehicle Research. We thank William T. Freeman and many others from CMU and Google for providing valuable feedback.