Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding

[πŸ“‘ Technical Report]   [πŸ’œ Project Page (Demo & Benchmark)]   [🌐 Code ]

ΒΉShanghai AI Laboratory, Β²Shanghai Innovation Institute, Β³Shanghai Jiao Tong University

⁴Nanjing University, ⁡The University of Sydney

⁢The Chinese University of Hong Kong, ⁷Tsinghua University

πŸ“š Introduction

We introduce Lumina-DiMOO, an omni foundational model for seamless multimodal generation and understanding. Lumina-DiMOO is distinguished by four key innovations:

  • Unified Discrete Diffusion Architecture: Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities.

  • Versatile Multimodal Capabilities: Lumina-DiMOO supports a broad spectrum of multimodal tasks, including text-to-image generation (allowing for arbitrary and high-resolution), image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), alongside advanced image understanding.

  • Higher Sampling Efficiency: Compared to previous AR or hybrid AR-diffusion paradigms, Lumina-DiMOO demonstrates remarkable sampling efficiency. Additionally, we design a bespoke caching method to further speed up the sampling speed by 2x.

  • Superior Performance: Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multimodal models, setting a new standard in the field.

πŸ“½οΈ Qualitative Results

Here we present some comparative generation results with other models. For additional visualization results, please see our Project Page.

Text-to-Image Comparison
Image Editing Comparison
Controllable & Subject-Driven Generation Comparison
Image Inpainting & Extrapolation

πŸ“Š Quantitative Performance

GenEval Benchmark
DPG Benchmark
OneIG-EN Benchmark
TIIF Benchmark
Image-to-Image Benchmark
Image Understanding Benchmark

πŸš€ Sampling Speed Analysis

  • Since text generation is performed in a block-wise manner, unlike image generation which uses a single global decoding step, its speed is influenced by both the number of blocks and the number of steps. Therefore, the speed improvement of image understanding is not as significant as that of image generation.

  • Lumina-DiMOO Settings: For image generation, we sample 64 steps. For image understanding, we set the block length to 256 and the number of sampling steps to 128.

Sampling Speed Comparison

πŸ’¬ Discussion

You can reach us with this WeChat QR code!


πŸ“œ Acknowledgements

This work was also supported and implemented by MindSpeed MM, an open-source training framework for large-scale multimodal models designed for distributed training, developed and maintained by Huawei's Computing Product Line. Specifically Optimized for Huaweiβ€˜s Ascend AI chips, MindSpeed MM offers comprehensive support for distributed training and is tailored for a wide range of multimodal tasks.

πŸ“– BibTeX

@article{xin2025lumina,
  title={Lumina-DiMOO: An Omni Diffusion Large Language Model for Multi-Modal Generation and Understanding},
  author={Xin, Yi and Qin, Qi and Luo, Siqi and Zhu, Kaiwen and Yan, Juncheng and Tai, Yan and Lei, Jiayi and Cao, Yuewen and Wang, Keqi and Wang, Yibin and others},
  journal={arXiv preprint arXiv:2510.06308},
  year={2025}
}
Downloads last month
7,552
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 8 Ask for provider support

Collection including Alpha-VLLM/Lumina-DiMOO