Moebius: 0.2B image inpainting model with 10B-level performance
Moebius: Achieving 10B-Level Image Inpainting with a 0.2B Model
Authors: Kangsheng Duan, Ziyang Xu, Wenyu Liu, Xiaohu Ruan, Xiaoxin Chen, Xinggang Wang Affiliations: Huazhong University of Science and Technology | VIVO AI Lab Status: In submission
Abstract: While industrial foundation models with 10B+ parameters have redefined the limits of image inpainting, their massive computational requirements make real-world deployment nearly impossible. Creating a task-specific specialist is a viable path, but extreme compression usually creates a "representation bottleneck." To solve this, we introduce Moebius, a lightweight framework that reconstructs the diffusion backbone via the Local- Mix Interaction (LMI) block. By condensing spatial and semantic data into fixed-size linear matrices, Moebius maintains complex interactions while slashing parameter counts. This is paired with an adaptive multi-granularity distillation strategy in the latent space to ensure high-fidelity alignment. Moebius matches or beats the 11.9B FLUX.1-Fill-Dev model while using of the parameters and offering a speedup.
🚀 Core Highlights
- Extreme Parametric Efficiency (): Operates with only
0.22B(226M) parameters compared to the11.9Bof FLUX.1-Fill-Dev. - Blazing Inference Speed: Latency of just
26.01 ms/stepon a single GPU, resulting in a total runtime acceleration of over . - Elite Generation Quality: Performs on par with or exceeds SOTA generalists (FLUX.1-Fill-Dev, SD3.5 Large-Inpainting) across 6 benchmarks, including Places2 (natural) and CelebA-HQ/FFHQ (portraits).
- Architectural Innovation: Replaces quadratic overhead with the LMI block.
- Smart Distillation: Uses a gradient-norm adaptive loss to bridge the gap between a "teacher" (PixelHacker) and the "student" (Moebius).
🛠️ Technical Architecture
Moebius utilizes a Latent Diffusion Model (LDM) framework integrated with Latent Categories Guidance (LCG). The primary innovation lies in the systematic restructuring of the denoising U-Net.
1. The LMI Block
The Local-λ Mix Interaction block is designed to bypass the computational heavy-lifting of traditional attention. It consists of two primary components:
- Local- Module: Summarizes spatial contexts.
- Interactive- Module: Summarizes global semantic priors.
These modules compress information into fixed-size linear matrices, ensuring that the model retains the ability to handle complex latent interactions without the parameter bloat.
2. Adaptive Multi-Granularity Distillation
To prevent the "capacity drop" typical of extreme compression, Moebius employs a sophisticated distillation process:
- Latent Space Operation: All distillation occurs within the latent space to avoid the high cost of pixel-space decoding.
- Multi-Level Supervision: Aligns everything from microscopic intermediate features to macroscopic diffusion trajectories.
- Dynamic Balancing: Uses a gradient-based adaptive weighting mechanism to optimize the loss function.
📊 Performance Comparison
Moebius shatters the heavy-compute narrative by proving that a task-specific specialist can outperform a bloated generalist.
| Metric | FLUX.1-Fill-Dev (Generalist) | Moebius (Specialist) | Improvement |
|---|---|---|---|
| Parameters | smaller | ||
| Inference Speed | Baseline | faster | |
| Param Ratio | Extreme Efficiency | ||
| Quality | High | High | On-par / Surpasses |
🖼️ Visual Validation
The model was tested across diverse datasets to ensure robustness:
- Natural Scenes: Evaluated on
Places2. - Portrait Scenes: Evaluated on
CelebA-HQandFFHQ.
Figure 1: Comparison of Moebius vs. SOTA on natural landscapes.
Figure 2: Comparison of facial plausibility and texture detail.
💡 Final Philosophy: Specialist vs. Generalist
Moebius challenges the industry trend of "blind scaling." It asks: Can a model be smarter and faster if the task is explicitly defined? By mapping the synergy frontier between compact architecture and distillation, Moebius proves that a highly optimized specialist can liberate AI object removal and inpainting from parameter bloat, making it viable for consumer-grade and edge devices.
📚 Reference
@misc{DuanAndXu2026Moebius,
title={Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
author={Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
year={2026},
eprint={2606.19195},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.19195},
}