PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence

Abstract

We present PreF3R, Pose-Free Feed-forward 3D Reconstruction from an image sequence of variable length. Unlike previous approaches, PreF3R removes the need for camera calibration and reconstructs the 3D Gaussian field within a canonical coordinate frame directly from a sequence of unposed images, enabling efficient novel-view rendering.

We leverage DUSt3R's ability for pair-wise 3D structure reconstruction, and extend it to sequential multi-view input via a spatial memory network, eliminating the need for optimization-based global alignment. Additionally, PreF3R incorporates a dense Gaussian parameter prediction head, which enables subsequent novel-view synthesis with differentiable rasterization. This allows supervising our model with the combination of photometric loss and pointmap regression loss, enhancing both photorealism and structural accuracy.

Given a sequence of ordered images, PreF3R incrementally reconstructs the 3D Gaussian field at 20 FPS, therefore enabling real-time novel-view rendering. Empirical experiments demonstrate that PreF3R is an effective solution for the challenging task of pose-free feed-forward novel-view synthesis, while also exhibiting robust generalization to unseen scenes.

Examples

Left: Frames rendered by PreF3R. Right: Ground-truth video.

Method Overview

PreF3R's overall architecture. An ordered set of unposed images is fed into PreF3R sequentially. At timestamp t, the input frame is first encoded by a ViT-encoder, and then decoded into the query feature by the Target Decoder. The Target Decoder is intertwined with the Reference Decoder through cross-attention. Simultaneously, the query feature of the previous frame queries the memory bank to produce the fused feature, which the Reference Decoder decodes into the output feature. The output feature is then processed by the Gaussian Head and the Point Head to produce pixel-aligned Gaussian primitives. The output from each frame is accumulated into global Gaussian primitives, enabling fast novel-view synthesis through rasterization.

Novel-view synthesis

PreF3R operates at 20 FPS on a single H100 GPU, enabling real-time novel-view synthesis from numerous input images through differentiable rasterization.

Incremental reconstruction

PreF3R performs incremental Gaussian reconstruction in real-time. Left: in-domain scene reconstruction from ScanNet++; Right: out-of-domain scene reconstruction from Tanks and Temples.

BibTeX

@misc{chen2024pref3rposefreefeedforward3d,
      title={PreF3R: Pose-Free Feed-Forward 3D Gaussian Splatting from Variable-length Image Sequence}, 
      author={Zequn Chen and Jiezhi Yang and Heng Yang},
      year={2024},
      eprint={2411.16877},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.16877}, 
}