Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild

Core Objective: Lift4D aims to recover the complete geometry, visual appearance, and temporal deformation of dynamic objects from a single, unconstrained monocular video—even for parts of the object that the camera never actually sees.

🧩 The Challenge of 4D Reconstruction

Reconstructing dynamic 3D objects from a single video is difficult because it requires a delicate balance between direct visual evidence and learned data priors. Current methodologies generally fall into two flawed categories:

Approach	Mechanism	Primary Weakness
Direct Prediction	Predicts 3D representations per-frame from pixels.	~~Limited by a lack of diverse 4D training datasets.~~
Deformable Refinement	Initializes a 3D shape, then warps it based on video.	~~Priors are only used at the start; fails during heavy occlusion or extreme motion.~~

🛠️ The Lift4D Methodology

Lift4D introduces a test-time optimization framework designed to bridge the gap between per-frame estimation and global 4D consistency.

1. Temporally Consistent Initialization

The process begins with an Image-to-3D DiT (Diffusion Transformer). To prevent the "jitter" common in per-frame predictions, Lift4D employs causal latent propagation.

The Process: For any frame $s$ , the 3D latent is initialized by blending new noise with the previously denoised latent.
The Output: These are decoded into independent sets of Gaussian splats for every frame.

2. The 4D Representation

Instead of treating frames as isolated events, Lift4D consolidates them into a unified 4D model:

Canonical Gaussians: A base 3D representation.
Deformation Nodes: Two sets of sparse nodes that animate the canonical shape.

3. The Optimization Pipeline

The framework uses a dual-loss system to "sculpt" the final object:

The Loss Functions:

Geometry Loss ( $\mathcal{L}_{rec}$ ): Fits the first set of deformation nodes to the per-frame predicted geometry.
Appearance Loss ( $\mathcal{L}_{app}$ ): Refines colors and a second set of "fine appearance" nodes.

The "Hallucination" Trick: To fill in the blanks (occluded areas), the system:

Renders the 4D model from random novel views.
Adds noise to these renders.
Uses a novel-view diffusion prior to denoise them, conditioned on frames where occlusions have been inpainted using the per-frame 3D data.

🚀 Results & Performance

Lift4D demonstrates a significant leap over previous baselines. It is particularly effective at:

Maintaining temporal coherence.
Producing sharper textures.
Handling non-rigid motion and severe occlusions.
Successfully hallucinating unobserved regions.

📚 Reference

@article{litman2026lift4d, 
 author = {Litman, Yehonathan and Ma, Xiaoxuan and Shah, Manan and Ugrinovic, Nicol\'{a}s and Kitani, Kris and De la Torre, Fernando and Tulsiani, Shubham}, 
 title = {Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild}, 
 journal = {arXiv preprint arXiv:2606.23688}, 
 year = {2026}, 
}