VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing

Juan Luis Gonzalez1, Xu Yao1, Alex Whelan1, Kyle Olszewski1, Hyeongwoo Kim2, Pablo Garrido1

1Flawless AI   2Imperial College London

CVPR 2025

Fig. 1-a. Our VideoSPaTS implicit video representation. From left to right: Canonical Foreground, Canonical Background, Foreground, Background, Composited, GT (Ground Truth).

Fig. 1-b. Our VideoSPaTS. From left to right: Edited canonical spaces, Propagated Foreground and Background, and Composited.

Fig. 2. Our Video SPatiotemporal Spline model, referred to as VideoSPatS, disentangles occlusions, motion, and appearance into editable layers. Using independent branches for the regions of interest, it learns Neural Spline Fields (Sec. 3.1) separating the foreground f (top) and background b (bottom) into editable, canonical representations.

Abstract

We present an implicit video representation for occlusions, appearance, and motion disentanglement from monocular videos, which we call Video SPatiotemporal Splines (VideoSPatS). Unlike previous methods that map time and coordinates to deformation and canonical colors, our VideoSPatS maps input coordinates into Spatial and Color Spline deformation fields Ds and Dc, which disentangle motion and appearance in videos. With spline-based parametrization, our method naturally generates temporally consistent flow and guarantees long-term temporal consistency, which is crucial for convincing video editing. Using multiple prediction branches, our VideoSPatS model also performs layer separation between the latent video and the selected occluder. By disentangling occlusions, appearance, and motion, our method enables better spatiotemporal modeling and editing of diverse videos, including in-the-wild talking head videos with challenging occlusions, shadows, and specularities while maintaining an appropriate canonical space for editing. We also present general video modeling results on the DAVIS and CoDeF datasets, as well as our own talking head video dataset collected from open-source web videos. Extensive ablations show the combination of Ds and Dc under neural splines can overcome motion and appearance ambiguities, paving the way for more advanced video editing models.

Results

Editing with Time-Dependent Appearance

Original Learned Video Implicit Representation

Fig. 3. Our VideoSPaTS. From left to right: Propagated edited Foreground, Edited Foreground canonical space, Background Canonical Space, Background, Composited. Time dependent appearance is preserved after editing.

Video Reconstruction Results

Fig. 4. Reconstruction results, from left to right: GT, Layered Atlases, Deformable Sprites, CoDeF, and our VideoSPatS.



Fig. 5. Reconstructions and editing results by our VideoSPaTS on the CoDeF datasets with our temporally varying appearance (see third column). Reconstruction samples, from left to right: Canonical Foreground, Canonical Background, Foreground, Background, Composited, GT. Editing samples, from left to right: Edited canonical spaces, Propagated Foreground and Background, and Composited

Editing Results

Fig. 6. From left to right, CoDeF, Our VideoSPaTS, and the reference video.

Motion Editing Results.

Motion Editing Results (cont.).

Fig. 7. From left to right: Foreground Canonical, Background Canonical, Edited Foreground, Background, Composited, Reference. By modifying the precomputed control points, we can smoothly perform motion editing. Thanks to our spline deformation fields, instead of rendering frames where the foreground is instantly `teleporting' to the offset location, our motion edited videos are smoothly rendered without discontinuities.

Long Video Reconstructions

Fig. 8. Results on longer video sequences (10s - 240 frames).

Failure Cases

Fig. 9. Failure cases. Top row: the dynamic region in background image (face region) is too small. Bottom row: too large foreground motion and self-occlusion (opposite sides of brush) cause a double brush to appear in the foreground canonical space.