← Back to news

Moebius: 0.2B image inpainting model with 10B-level performance

hustvl.github.io|116 points|29 comments|by DSemba|Jun 22, 2026

Moebius: Achieving 10B-Level Image Inpainting with a 0.2B Model

Authors: Kangsheng Duan1,^{1,*}, Ziyang Xu1,,^{1,*, \dagger}, Wenyu Liu1^1, Xiaohu Ruan2^2, Xiaoxin Chen2^2, Xinggang Wang1,📧^{1, \text{📧}} Affiliations: 1^1Huazhong University of Science and Technology | 2^2VIVO AI Lab Status: In submission


Abstract: While industrial foundation models with 10B+ parameters have redefined the limits of image inpainting, their massive computational requirements make real-world deployment nearly impossible. Creating a task-specific specialist is a viable path, but extreme compression usually creates a "representation bottleneck." To solve this, we introduce Moebius, a lightweight framework that reconstructs the diffusion backbone via the Local-λ\lambda Mix Interaction (Lλ\lambdaMI) block. By condensing spatial and semantic data into fixed-size linear matrices, Moebius maintains complex interactions while slashing parameter counts. This is paired with an adaptive multi-granularity distillation strategy in the latent space to ensure high-fidelity alignment. Moebius matches or beats the 11.9B FLUX.1-Fill-Dev model while using <2%<2\% of the parameters and offering a >15×>15\times speedup.


🚀 Core Highlights

  • Extreme Parametric Efficiency (2%\approx 2\%): Operates with only 0.22B (226M) parameters compared to the 11.9B of FLUX.1-Fill-Dev.
  • Blazing Inference Speed: Latency of just 26.01 ms/step on a single GPU, resulting in a total runtime acceleration of over 15×15\times.
  • Elite Generation Quality: Performs on par with or exceeds SOTA generalists (FLUX.1-Fill-Dev, SD3.5 Large-Inpainting) across 6 benchmarks, including Places2 (natural) and CelebA-HQ/FFHQ (portraits).
  • Architectural Innovation: Replaces quadratic overhead with the Lλ\lambdaMI block.
  • Smart Distillation: Uses a gradient-norm adaptive loss to bridge the gap between a "teacher" (PixelHacker) and the "student" (Moebius).

🛠️ Technical Architecture

Moebius utilizes a Latent Diffusion Model (LDM) framework integrated with Latent Categories Guidance (LCG). The primary innovation lies in the systematic restructuring of the denoising U-Net.

1. The Lλ\lambdaMI Block

The Local-λ Mix Interaction block is designed to bypass the computational heavy-lifting of traditional attention. It consists of two primary components:

  • Local-λ\lambda Module: Summarizes spatial contexts.
  • Interactive-λ\lambda Module: Summarizes global semantic priors.

These modules compress information into fixed-size linear matrices, ensuring that the model retains the ability to handle complex latent interactions without the parameter bloat.

2. Adaptive Multi-Granularity Distillation

To prevent the "capacity drop" typical of extreme compression, Moebius employs a sophisticated distillation process:

  • Latent Space Operation: All distillation occurs within the latent space to avoid the high cost of pixel-space decoding.
  • Multi-Level Supervision: Aligns everything from microscopic intermediate features to macroscopic diffusion trajectories.
  • Dynamic Balancing: Uses a gradient-based adaptive weighting mechanism to optimize the loss function.

📊 Performance Comparison

Moebius shatters the heavy-compute narrative by proving that a task-specific specialist can outperform a bloated generalist.

MetricFLUX.1-Fill-Dev (Generalist)Moebius (Specialist)Improvement
Parameters11.9B11.9\text{B}0.22B0.22\text{B}54×\approx 54\times smaller
Inference SpeedBaseline26.01 ms/step26.01\text{ ms/step}>15×>15\times faster
Param Ratio100%100\%0.2211.91.85%\frac{0.22}{11.9} \approx 1.85\%Extreme Efficiency
QualityHigh\ge HighOn-par / Surpasses

🖼️ Visual Validation

The model was tested across diverse datasets to ensure robustness:

  • Natural Scenes: Evaluated on Places2.
  • Portrait Scenes: Evaluated on CelebA-HQ and FFHQ.

Natural Scenes Comparison Figure 1: Comparison of Moebius vs. SOTA on natural landscapes.

Portrait Scenes Comparison Figure 2: Comparison of facial plausibility and texture detail.


💡 Final Philosophy: Specialist vs. Generalist

Moebius challenges the industry trend of "blind scaling." It asks: Can a model be smarter and faster if the task is explicitly defined? By mapping the synergy frontier between compact architecture and distillation, Moebius proves that a highly optimized specialist can liberate AI object removal and inpainting from parameter bloat, making it viable for consumer-grade and edge devices.


📚 Reference

@misc{DuanAndXu2026Moebius,
  title={Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
  author={Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
  year={2026},
  eprint={2606.19195},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2606.19195},
}