ArtEq: Generalizing Neural Human Fitting to Unseen Poses
With Articulated SE(3) Equivariance
Haiwen Feng1, Peter Kulits1, Shichen Liu2, Michael J. Black1 and Victoria Abrevaya1
1 Max Planck Institute for Intelligent Systems, Tuebingen, Germany
2 University of Southern California
TL;DR ArtEq (pron: Artique) is a carefully designed and principled method that extends SE(3) equivariance to articulated structures, enabling the direct regression of SMPL parameters from a 3D point cloud, which (1) has significant zero-shot pose generalization (45~60% better in V2V & MPJPE), (2) is 1000x faster during inference time compared to competing methods, and (3) has a 30x less network parameters than the SOTA.
Abstract
We address the problem of fitting a parametric human body model (SMPL) to point cloud data. Optimization-based methods require careful initialization and are prone to becoming trapped in local optima. Learning-based methods address this but do not generalize well when the input pose is far from those seen during training. For rigid point clouds, remarkable generalization has been achieved by leveraging SE(3)-equivariant networks, but these methods do not work on articulated objects. In this work we extend this idea to human bodies and propose ArtEq, a novel part-based SE(3)-equivariant neural architecture for SMPL model estimation from point clouds. Specifically, we learn a part detection network by leveraging local SO(3) invariance, and regress shape and pose using articulated SE(3) shape-invariant and pose-equivariant networks, all trained end-to-end. Our novel equivariant pose regression module leverages the permutation-equivariant property of self-attention layers to preserve rotational equivariance. Experimental results show that ArtEq can generalize to poses not seen during training, outperforming state-of-the-art methods by 45~60%, without requiring an optimization refinement step. Further, compared with competing works, our method is more than three orders of magnitude faster during inference and has 97.3% fewer parameters.
The code and model will be released later for research purposes at our official repo.
Overview
We first obtain point-wise equivariant features using a small equivariant point network, which provides a C-dimensional feature vector per point and per-group element (e.g. the 60 rotational symmetries of the icosahedron). We then convert these into point-wise invariant features by pooling over the rotation group to obtain a part segmentation of the point cloud. Using the segmentation, we softly aggregate the point-wise features into part-based equivariant features. A self-attention layer processes these in an efficient manner while preserving equivariance. We cast pose regression as a weight prediction task by predicting the weights necessary to perform a weighted average over each rotation element. Finally, we transform the part-based features into invariant ones to obtain an estimate of the shape.
Please refer to our arXiv paper for more details :)
Qualitative comparison