Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild
Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild
Core Objective: Lift4D aims to recover the complete geometry, visual appearance, and temporal deformation of dynamic objects from a single, unconstrained monocular video—even for parts of the object that the camera never actually sees.
🧩 The Challenge of 4D Reconstruction
Reconstructing dynamic 3D objects from a single video is difficult because it requires a delicate balance between direct visual evidence and learned data priors. Current methodologies generally fall into two flawed categories:
| Approach | Mechanism | Primary Weakness |
|---|---|---|
| Direct Prediction | Predicts 3D representations per-frame from pixels. | |
| Deformable Refinement | Initializes a 3D shape, then warps it based on video. |
🛠️ The Lift4D Methodology
Lift4D introduces a test-time optimization framework designed to bridge the gap between per-frame estimation and global 4D consistency.
1. Temporally Consistent Initialization
The process begins with an Image-to-3D DiT (Diffusion Transformer). To prevent the "jitter" common in per-frame predictions, Lift4D employs causal latent propagation.
- The Process: For any frame , the 3D latent is initialized by blending new noise with the previously denoised latent.
- The Output: These are decoded into independent sets of Gaussian splats for every frame.
2. The 4D Representation
Instead of treating frames as isolated events, Lift4D consolidates them into a unified 4D model:
- Canonical Gaussians: A base 3D representation.
- Deformation Nodes: Two sets of sparse nodes that animate the canonical shape.
3. The Optimization Pipeline
The framework uses a dual-loss system to "sculpt" the final object:
The Loss Functions:
- Geometry Loss (): Fits the first set of deformation nodes to the per-frame predicted geometry.
- Appearance Loss (): Refines colors and a second set of "fine appearance" nodes.
The "Hallucination" Trick: To fill in the blanks (occluded areas), the system:
- Renders the 4D model from random novel views.
- Adds noise to these renders.
- Uses a novel-view diffusion prior to denoise them, conditioned on frames where occlusions have been inpainted using the per-frame 3D data.
🚀 Results & Performance
Lift4D demonstrates a significant leap over previous baselines. It is particularly effective at:
- Maintaining temporal coherence.
- Producing sharper textures.
- Handling non-rigid motion and severe occlusions.
- Successfully hallucinating unobserved regions.
📚 Reference
@article{litman2026lift4d,
author = {Litman, Yehonathan and Ma, Xiaoxuan and Shah, Manan and Ugrinovic, Nicol\'{a}s and Kitani, Kris and De la Torre, Fernando and Tulsiani, Shubham},
title = {Lift4D: Harmonizing Single-View 3D Estimation for 4D Reconstruction In-the-Wild},
journal = {arXiv preprint arXiv:2606.23688},
year = {2026},
}