junzhin nielsr HF Staff commited on
Commit
33ea823
Β·
verified Β·
1 Parent(s): e8c3b95

Enhance model card with detailed information and `library_name: transformers` (#1)

Browse files

- Enhance model card with detailed information and `library_name: transformers` (ab056b997d8ed0723b1f67b15195690a9a16b168)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +296 -6
README.md CHANGED
@@ -1,13 +1,14 @@
1
  ---
2
- license: apache-2.0
3
- pipeline_tag: any-to-any
4
  language:
5
  - en
6
  - zh
 
7
  metrics:
8
  - accuracy
9
- base_model:
10
- - ByteDance-Seed/BAGEL-7B-MoT
11
  tags:
12
  - medical
13
  - vision-language
@@ -22,6 +23,10 @@ tags:
22
  - modality-transfer
23
  ---
24
 
 
 
 
 
25
  <div align="center">
26
  <a href="https://uni-medical.github.io/UniMedVL_Web/" target="_blank">
27
  <img alt="Project Page" src="https://img.shields.io/badge/🌐_Project-Page-blue" />
@@ -62,6 +67,67 @@ A unified medical foundation model enabling both understanding and generation ca
62
  <img src="./assets/teaser.png" width="95%"/>
63
  </div>
64
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  ## Model Details
66
 
67
  ### Model Description
@@ -92,6 +158,230 @@ The model can be directly used for:
92
 
93
  - **Clinical Decision Making**: This model is for research purposes only and should NOT be used for actual clinical diagnosis or treatment decisions
94
 
95
- ## Acknowledgments
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
96
 
97
- We sincerely thank the [Bagel](https://arxiv.org/abs/2505.14683) project for providing the foundational framework upon which our code and model training are built.
 
 
 
1
  ---
2
+ base_model:
3
+ - ByteDance-Seed/BAGEL-7B-MoT
4
  language:
5
  - en
6
  - zh
7
+ license: apache-2.0
8
  metrics:
9
  - accuracy
10
+ pipeline_tag: any-to-any
11
+ library_name: transformers
12
  tags:
13
  - medical
14
  - vision-language
 
23
  - modality-transfer
24
  ---
25
 
26
+ <p align="center">
27
+ <img src="./images/logo.png" width="50px" height="50px"/>
28
+ </p>
29
+
30
  <div align="center">
31
  <a href="https://uni-medical.github.io/UniMedVL_Web/" target="_blank">
32
  <img alt="Project Page" src="https://img.shields.io/badge/🌐_Project-Page-blue" />
 
67
  <img src="./assets/teaser.png" width="95%"/>
68
  </div>
69
 
70
+ ## Paper Abstract
71
+
72
+ Medical diagnostic applications require models that can process multimodal medical inputs (images, patient histories, lab results) and generate diverse outputs including both textual reports and visual content (annotations, segmentation masks, and images). Despite this need, existing medical AI systems disrupt this unified process: medical image understanding models interpret images but cannot generate visual outputs, while medical image generation models synthesize images but cannot provide textual explanations. This leads to gaps in data representation, feature integration, and task-level multimodal capabilities. To this end, we propose a multi-level framework that draws inspiration from diagnostic workflows through the Observation-Knowledge-Analysis (OKA) paradigm. Specifically, at the observation level, we construct UniMed-5M, a dataset comprising over 5.6M samples that reformat diverse unimodal data into multimodal pairs for foundational observation. At the knowledge level, we propose Progressive Curriculum Learning that systematically introduces medical multimodal knowledge. At the analysis level, we introduce UniMedVL, the first medical unified multimodal model for the simultaneous analysis of image understanding and generation tasks within a single architecture. UniMedVL achieves superior performance on five medical image understanding benchmarks, while matching specialized models in generation quality across eight medical imaging modalities. Crucially, our unified architecture enables bidirectional knowledge sharing: generation tasks enhance visual understanding features, demonstrating that integrating traditionally separate capabilities within a single medical framework unlocks improvements across diverse medical vision-language tasks. Code is available at this https URL .
73
+
74
+ ## πŸ“š Introduction
75
+
76
+ We introduce **UniMedVL**, the unified medical foundation model for seamless multimodal understanding and generation. Four key innovations distinguish UniMedVL:
77
+
78
+ - **Unified Observation-Knowledge-Analysis Architecture:** UniMedVL sets itself apart from prior medical AI models by following a clinically-inspired three-level framework that mirrors how physicians process medical information, enabling both understanding and generation within a single architecture.
79
+
80
+ - **Versatile Medical Multimodal Capabilities:** UniMedVL supports a broad spectrum of medical tasks, including visual question answering, medical report generation, text-to-medical-image synthesis, cross-modal translation, and virtual staining across 9 imaging modalities.
81
+
82
+ - **Large-Scale Medical Dataset:** We present UniMed-5M, a comprehensive medical multimodal dataset containing 5.6M+ high-quality samples with three-stage quality verification and expert validation, covering understanding, generation, and interleaved tasks.
83
+
84
+ - **Superior Performance:** UniMedVL achieves state-of-the-art performance on multiple evaluation datasets, with 75.4% accuracy on SLAKE VQA, 53.5% on PathVQA, and competitive generation quality (96.29 average gFID), setting a new standard in unified medical AI.
85
+
86
+ <div align="center">
87
+ <img src="images/overview_ver3.png" alt="UniMedVL Architecture" width="100%">
88
+ </div>
89
+
90
+ ## πŸ”¬ Methodology
91
+
92
+ ### πŸ“‹ OKA Framework: Observation-Knowledge-Analysis
93
+
94
+ UniMedVL follows a workflow-guided three-level framework that mirrors how physicians process medical information:
95
+
96
+ ```mermaid
97
+ flowchart TD
98
+ A[Observation Level] --> B[Knowledge Level] --> C[Analysis Level]
99
+
100
+ A1[UniMed-5M Dataset<br/>5.6M samples<br/>8 imaging modalities] --> A
101
+ A --> A2[Quality Control<br/>Three-stage verification<br/>Expert validation]
102
+
103
+ B1[Progressive Curriculum<br/>Foundation β†’ Instruction β†’ Unified] --> B
104
+ B --> B2[Cross-modal Knowledge Fusion<br/>Understanding ↔ Generation]
105
+
106
+ C1[Unified Architecture<br/>Dual encoders + MOT] --> C
107
+ C --> C2[Multimodal Outputs<br/>Reports + Images + Annotations]
108
+ ```
109
+
110
+ ### 🎯 Training Strategy
111
+
112
+ **Three-Stage Progressive Curriculum Learning:**
113
+
114
+ 1. **πŸ”§ Stage 1 - Foundation Training** (85K steps)
115
+ - Basic medical pattern recognition
116
+ - Visual-language alignment
117
+ - Data ratio: 75% I2T, 25% T2I
118
+
119
+ 2. **πŸ“š Stage 2 - Instruction Tuning** (120K steps)
120
+ - Cross-modal understanding enhancement
121
+ - Medical expertise development
122
+ - Data ratio: 40% I2T, 45% T2I, 10% Interleaved
123
+
124
+ 3. **πŸš€ Stage 3 - Unified Training** (70K steps)
125
+ - Advanced multimodal synthesis
126
+ - Interleaved task mastery
127
+ - Data ratio: 37% I2T, 35% T2I, 25% Interleaved
128
+
129
+ ---
130
+
131
  ## Model Details
132
 
133
  ### Model Description
 
158
 
159
  - **Clinical Decision Making**: This model is for research purposes only and should NOT be used for actual clinical diagnosis or treatment decisions
160
 
161
+ ## πŸ’¬ Qualitative Results
162
+
163
+ Here we present some comprehensive visualization results demonstrating UniMedVL's capabilities. **For additional visualization results and comparisons, please see our [Project Page](https://uni-medical.github.io/UniMedVL_Web/).**
164
+
165
+ <details open>
166
+ <summary>Performance Across Training Stages</summary>
167
+ <div align="center">
168
+ <img src="images/topline_performance.png" alt="Performance Comparison" width="100%">
169
+ <p><em>Comprehensive performance comparison across training stages and modalities</em></p>
170
+ </div>
171
+ </details>
172
+
173
+ <details open>
174
+ <summary>Multimodal Tasks Demonstration</summary>
175
+ <div align="center">
176
+ <img src="images/fig_results_ver2.png" alt="Multimodal Task Results" width="100%">
177
+ <p><em>Comprehensive visualization of UniMedVL's multimodal capabilities across diverse medical tasks</em></p>
178
+ </div>
179
+ </details>
180
+
181
+ <details open>
182
+ <summary>Medical Visual Question Answering</summary>
183
+ <div align="center">
184
+ <img src="images/visual_question_answering.png" alt="Medical VQA Examples" width="60%">
185
+ <p><em>Medical Visual Question Answering examples showing model's diagnostic reasoning capabilities</em></p>
186
+ </div>
187
+ </details>
188
+
189
+ <details open>
190
+ <summary>Medical Report Generation</summary>
191
+ <div align="center">
192
+ <img src="images/reportgeneration.png" alt="Medical Report Generation" width="60%">
193
+ <p><em>Automated medical report generation examples across different imaging modalities</em></p>
194
+ </div>
195
+ </details>
196
+
197
+ <details open>
198
+ <summary>Text-to-Medical-Image Generation</summary>
199
+ <div align="center">
200
+ <img src="images/text2img1.png" alt="Text-to-Image Generation Examples 1" width="60%">
201
+ <p><em>Text-to-medical-image generation results showing high-quality synthesis</em></p>
202
+ </div>
203
+ <div align="center">
204
+ <img src="images/text2img2.png" alt="Text-to-Image Generation Examples 2" width="60%">
205
+ <p><em>Additional text-to-medical-image generation examples across modalities</em></p>
206
+ </div>
207
+ </details>
208
+
209
+ <details open>
210
+ <summary> Medical-Image Generation across 8 modalities </summary>
211
+
212
+
213
+
214
+ ### Chest X-Ray (CXR)
215
+ <div align="center">
216
+ <img src="images/cxr.png" alt="Chest X-Ray" width="60%">
217
+ </div>
218
+
219
+ ### Computed Tomography (CT)
220
+ <div align="center">
221
+ <img src="images/ct.png" alt="CT Scan" width="60%">
222
+ </div>
223
+
224
+ ### Magnetic Resonance Imaging (MRI)
225
+ <div align="center">
226
+ <img src="images/mri.png" alt="MRI Scan" width="60%">
227
+ </div>
228
+
229
+ ### Ultrasound
230
+ <div align="center">
231
+ <img src="images/ultrasound.png" alt="Ultrasound" width="60%">
232
+ </div>
233
+
234
+ ### Histopathology (HIS)
235
+ <div align="center">
236
+ <img src="images/his.png" alt="Histopathology" width="60%">
237
+ </div>
238
+
239
+ ### Retinal Fundus Photography (CFP)
240
+ <div align="center">
241
+ <img src="images/retinal.png" alt="Retinal Fundus" width="60%">
242
+ </div>
243
+
244
+ ### Optical Coherence Tomography (OCT)
245
+ <div align="center">
246
+ <img src="images/oct.png" alt="OCT" width="60%">
247
+ </div>
248
+
249
+ ### Endoscopy
250
+ <div align="center">
251
+ <img src="images/endoscopy.png" alt="Endoscopy" width="60%">
252
+ </div>
253
+
254
+ </details>
255
+
256
+ ## πŸ“Š Quantitative Performance
257
+
258
+ <details open>
259
+ <summary>Medical Visual Question Answering Performance</summary>
260
+
261
+ | Model | Params | Type | VQA-RAD | SLAKE | PathVQA | OmniMedVQA | GMAI-MMBench |
262
+ |-------|--------|------|---------|-------|---------|------------|--------------|
263
+ | GMAI-VL | 7B | Medical-specific | 66.3 | 72.9 | 39.8 | 88.5 | 61.74 |
264
+ | HuatuoGPT-Vision | 7B | Medical-specific | 53.0 | 49.1 | 32.0 | 50.0 | 50.22 |
265
+ | Bagel | 7B | Unified | 60.09 | 58.91 | 39.05 | 71.13 | 48.11 |
266
+ | HealthGPT-L14 | 14B | Unified | 58.3 | 64.5 | 44.4 | 74.4 | 43.1 |
267
+ | **UniMedVL** | **14B** | **Unified** | **61.9** | **75.4** | **53.5** | **85.8** | **60.75** |
268
+
269
+ </details>
270
+
271
+
272
+ <details open>
273
+ <summary>Medical Image Generation Performance</summary>
274
+
275
+ *Text-to-image generation performance across 8 medical imaging modalities. Metrics: gFID ↓ (lower is better) / BioMedCLIP Score ↑ (higher is better)*
276
+
277
+ | Model | CFP | CXR | CT | HIS | MRI | OCT | Ultrasound | Endoscopy | Average |
278
+ |-------|-----|-----|----|----|-----|-----|------------|-----------|---------|
279
+ | Bagel (7B) | 217.19/0.650 | 182.80/0.662 | 163.78/0.652 | 206.18/0.643 | 175.74/0.639 | 307.80/0.719 | 255.78/0.672 | 214.61/0.668 | 215.49/0.660 |
280
+ | **UniMedVL (14B)** | **53.20/0.708** | **73.04/0.702** | **73.04/0.696** | **149.01/0.704** | **90.36/0.706** | **99.27/0.721** | **95.38/0.706** | **133.11/0.707** | **96.29/0.706** |
281
+
282
+ </details>
283
+
284
+ <details open>
285
+ <summary>Interleaved Multimodal Tasks Performance</summary>
286
+
287
+ **Virtual Immunohistochemistry Staining (H&E β†’ IHC)**
288
+
289
+ | Method | Type | PSNR ↑ | SSIM ↑ |
290
+ |--------|------|--------|--------|
291
+ | Pyramid Pix2pix | Specialized | 21.16 | 0.477 |
292
+ | HealthGPT-M3 | Unified | 15.81 | 0.242 |
293
+ | **UniMedVL** | **Unified** | **20.27** | **0.456** |
294
+
295
+ **MRI Super-Resolution (4Γ— upsampling)**
296
+
297
+ | Method | Type | PSNR ↑ | SSIM ↑ |
298
+ |--------|------|--------|--------|
299
+ | AMIR | Specialized | 31.99 | 0.939 |
300
+ | HealthGPT-M3 | Unified | 18.37 | 0.580 |
301
+ | **UniMedVL** | **Unified** | **27.29** | **0.890** |
302
+
303
+ **Cross-Modal Synthesis (T2 ↔ FLAIR MRI)**
304
+
305
+ | Method | Type | Average PSNR ↑ | Average SSIM ↑ |
306
+ |--------|------|----------------|----------------|
307
+ | ResViT | Specialized | 25.38 | 0.889 |
308
+ | HealthGPT-M3 | Unified | 19.09 | 0.748 |
309
+ | **UniMedVL** | **Unified** | **25.07** | **0.882** |
310
+
311
+ </details>
312
+
313
+ <details open>
314
+ <summary>Counterfactual Medical Image Generation</summary>
315
+
316
+ *Performance on counterfactual chest X-ray generation with explanatory text. † indicates unified fine-tuning variant.*
317
+
318
+ | Method | gFID ↓ | AUROC ↑ | F1 ↑ | BLEU-3 ↑ | METEOR ↑ | ROUGE-L ↑ |
319
+ |--------|--------|---------|------|----------|----------|-----------|
320
+ | ProgEmu | 29.21 | 0.792 | 0.891 | 0.124 | 0.410 | 0.261 |
321
+ | **UniMedVL†** | **27.17** | **0.797** | **0.873** | **0.264** | **0.449** | **0.465** |
322
+
323
+ </details>
324
+
325
+ ---
326
+
327
+ ## πŸ“ Open-Source Plan
328
+
329
+ - [x] **πŸ“„ Paper & Evaluations** - Research documentation and evaluation results
330
+ - [x] **πŸ–ΌοΈ Visualizations** - Result figures and model demonstrations
331
+ - [x] **πŸ’Ύ Model Checkpoints** - Pre-trained UniMedVL weights (14B parameters)
332
+ - [x] **πŸ”§ Inference Code** - Model loading and inference examples
333
+ - [ ] **πŸ‹οΈ Training Code** - Full training pipeline and configuration files
334
+ - [ ] **πŸ“ UniMed-5M Dataset** - Training dataset with quality control
335
+
336
+ ## πŸš€ Getting Started
337
+
338
+ ### Installation
339
+ ```bash
340
+ conda env create -f codes/environment.yaml
341
+ conda activate unimedvl
342
+ ```
343
+
344
+ ### Inference Scripts
345
+ Two interactive inference scripts are provided in the `codes/` directory:
346
+
347
+ 1. **Medical Visual Question Answering** (`interactive_vqa_inferencer.py`)
348
+
349
+ 2. **Medical Image Generation** (`interactive_image_generator.py`)
350
+
351
+ ### Quick Usage
352
+ 1. Download the UniMedVL checkpoint
353
+ 2. Set `model_path` and `ROOT` in the script configuration
354
+ 3. Run the script: `python codes/interactive_vqa_inferencer.py` or `python codes/interactive_image_generator.py`
355
+
356
+ ---
357
+
358
+ ## πŸ“œ License
359
+
360
+ This project is licensed under the **Apache License 2.0**. See the [LICENSE](LICENSE) file for details.
361
+
362
+ ---
363
+
364
+ ## πŸ“š Citations
365
+
366
+ If you use this project in your research or work, please cite it as:
367
+
368
+ ```bibtex
369
+ @misc{ning2025unimedvlunifyingmedicalmultimodal,
370
+ title={Unimedvl: Unifying Medical Multimodal Understanding And Generation Through Observation-Knowledge-Analysis},
371
+ author={Junzhi Ning and Wei Li and Cheng Tang and Jiashi Lin and Chenglong Ma and Chaoyang Zhang and Jiyao Liu and Ying Chen and Shujian Gao and Lihao Liu and Yuandong Pu and Huihui Xu and Chenhui Gou and Ziyan Huang and Yi Xin and Qi Qin and Zhongying Deng and Diping Song and Bin Fu and Guang Yang and Yuanfeng Ji and Tianbin Li and Yanzhou Su and Jin Ye and Shixiang Tang and Ming Hu and Junjun He},
372
+ year={2025},
373
+ eprint={2510.15710},
374
+ archivePrefix={arXiv},
375
+ primaryClass={cs.CV},
376
+ url={https://arxiv.org/abs/2510.15710},
377
+ ```
378
+
379
+ ---
380
+
381
+ ## πŸ™ Acknowledgments
382
+
383
+ We sincerely thank the following projects and their contributors for their invaluable open-source contributions that made this research possible:
384
 
385
+ - **[Bagel](https://github.com/ByteDance-Seed/Bagel)** - Foundation model architecture and training methodology inspiration
386
+ - **[HealthGPT](https://github.com/DCDmllm/HealthGPT)** - Medical domain adaptation and evaluation framework
387
+ - **[VLMEvalKit](https://github.com/open-compass/VLMEvalKit)** - Comprehensive evaluation toolkit for vision-language models