 
Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding
[π Technical Report] β [π Project Page (Demo & Benchmark)] β [π Code ]
ΒΉShanghai AI Laboratory, Β²Shanghai Innovation Institute, Β³Shanghai Jiao Tong University
β΄Nanjing University, β΅The University of Sydney
βΆThe Chinese University of Hong Kong, β·Tsinghua University
 
π Introduction
We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations:
- Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. 
- Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding. 
- Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x. 
- Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field. 
 
π½οΈ Qualitative Results
Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page.
Text-to-Image Comparison
 
Image Editing Comparison
 
Controllable & Subject-Driven Generation Comparison
 
Image Inpainting & Extrapolation
 
π Quantitative Performance
GenEval Benchmark
 
DPG Benchmark
 
OneIG-EN Benchmark
 
TIIF Benchmark
 
Image-to-Image Benchmark
 
Image Understanding Benchmark
 
π Sampling Speed Analysis
- Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation. 
- Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128. 
Sampling Speed Comparison
 
π¬ Discussion
You can reach us with this WeChat QR code!
  
 
π Acknowledgements
This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huaweiβs Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.
π BibTeX
@article{xin2025lumina,
  title={Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding},
  author={Xin, Yi and Qin, Qi and Luo, Siqi and Zhu, Kaiwen and Yan, Juncheng and Tai, Yan and Lei, Jiayi and Cao, Yuewen and Wang, Keqi and Wang, Yibin and others},
  journal={arXiv preprint arXiv:2510.06308},
  year={2025}
}
- Downloads last month
- 7,552
