Moebius: Achieving 10B-Level Image Inpainting with a 0.2B Model

Authors: Kangsheng Duan $^{1,*}$ , Ziyang Xu $^{1,*, \dagger}$ , Wenyu Liu $^1$ , Xiaohu Ruan $^2$ , Xiaoxin Chen $^2$ , Xinggang Wang $^{1, \text{📧}}$ Affiliations: $^1$ Huazhong University of Science and Technology | $^2$ VIVO AI Lab Status: In submission

Abstract: While industrial foundation models with 10B+ parameters have redefined the limits of image inpainting, their massive computational requirements make real-world deployment nearly impossible. Creating a task-specific specialist is a viable path, but extreme compression usually creates a "representation bottleneck." To solve this, we introduce Moebius, a lightweight framework that reconstructs the diffusion backbone via the Local- $\lambda$ Mix Interaction (L $\lambda$ MI) block. By condensing spatial and semantic data into fixed-size linear matrices, Moebius maintains complex interactions while slashing parameter counts. This is paired with an adaptive multi-granularity distillation strategy in the latent space to ensure high-fidelity alignment. Moebius matches or beats the 11.9B FLUX.1-Fill-Dev model while using $<2\%$ of the parameters and offering a $>15\times$ speedup.

🚀 Core Highlights

Extreme Parametric Efficiency ( $\approx 2\%$ ): Operates with only 0.22B (226M) parameters compared to the 11.9B of FLUX.1-Fill-Dev.
Blazing Inference Speed: Latency of just 26.01 ms/step on a single GPU, resulting in a total runtime acceleration of over $15\times$ .
Elite Generation Quality: Performs on par with or exceeds SOTA generalists (FLUX.1-Fill-Dev, SD3.5 Large-Inpainting) across 6 benchmarks, including Places2 (natural) and CelebA-HQ/FFHQ (portraits).
Architectural Innovation: Replaces quadratic overhead with the L $\lambda$ MI block.
Smart Distillation: Uses a gradient-norm adaptive loss to bridge the gap between a "teacher" (PixelHacker) and the "student" (Moebius).

🛠️ Technical Architecture

Moebius utilizes a Latent Diffusion Model (LDM) framework integrated with Latent Categories Guidance (LCG). The primary innovation lies in the systematic restructuring of the denoising U-Net.

1. The L $\lambda$ MI Block

The Local-λ Mix Interaction block is designed to bypass the computational heavy-lifting of traditional attention. It consists of two primary components:

Local- $\lambda$ Module: Summarizes spatial contexts.
Interactive- $\lambda$ Module: Summarizes global semantic priors.

These modules compress information into fixed-size linear matrices, ensuring that the model retains the ability to handle complex latent interactions without the parameter bloat.

2. Adaptive Multi-Granularity Distillation

To prevent the "capacity drop" typical of extreme compression, Moebius employs a sophisticated distillation process:

Latent Space Operation: All distillation occurs within the latent space to avoid the high cost of pixel-space decoding.
Multi-Level Supervision: Aligns everything from microscopic intermediate features to macroscopic diffusion trajectories.
Dynamic Balancing: Uses a gradient-based adaptive weighting mechanism to optimize the loss function.

📊 Performance Comparison

Moebius ~~shatters the heavy-compute narrative~~ by proving that a task-specific specialist can outperform a bloated generalist.

Metric	FLUX.1-Fill-Dev (Generalist)	Moebius (Specialist)	Improvement
Parameters	$11.9\text{B}$	$0.22\text{B}$	$\approx 54\times$ smaller
Inference Speed	Baseline	$26.01\text{ ms/step}$	$>15\times$ faster
Param Ratio	$100\%$	$\frac{0.22}{11.9} \approx 1.85\%$	Extreme Efficiency
Quality	High	$\ge$ High	On-par / Surpasses

🖼️ Visual Validation

The model was tested across diverse datasets to ensure robustness:

Natural Scenes: Evaluated on Places2.
Portrait Scenes: Evaluated on CelebA-HQ and FFHQ.

Figure 1: Comparison of Moebius vs. SOTA on natural landscapes.

Figure 2: Comparison of facial plausibility and texture detail.

💡 Final Philosophy: Specialist vs. Generalist

Moebius challenges the industry trend of "blind scaling." It asks: Can a model be smarter and faster if the task is explicitly defined? By mapping the synergy frontier between compact architecture and distillation, Moebius proves that a highly optimized specialist can liberate AI object removal and inpainting from parameter bloat, making it viable for consumer-grade and edge devices.

📚 Reference

@misc{DuanAndXu2026Moebius,
  title={Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
  author={Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
  year={2026},
  eprint={2606.19195},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.19195},
}