Improve model card: Add pipeline tag, library name, and essential links
Browse filesThis PR significantly enhances the model card for UniFilter by:
- Adding `pipeline_tag: image-text-to-text` for better discoverability and categorization on the Hub.
- Adding `library_name: transformers` to enable the automated "how to use" widget, as the model is compatible with the `transformers` library.
- Updating the existing placeholder paper link to the official Hugging Face paper page.
- Including direct links to the project page and GitHub repository for comprehensive information.
- Adding the `Citation` section for proper attribution.
README.md
CHANGED
|
@@ -1,15 +1,21 @@
|
|
| 1 |
---
|
| 2 |
-
license: mit
|
| 3 |
-
datasets:
|
| 4 |
-
- weizhiwang/unifilter_train_data
|
| 5 |
base_model:
|
| 6 |
- Qwen/Qwen2.5-1.5B-Instruct
|
| 7 |
- google/siglip-so400m-patch14-384
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
|
|
|
| 9 |
# UniFilter
|
| 10 |
|
| 11 |
-
Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data]() accepted by EMNLP 2025 Findings.
|
| 12 |
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
## Release
|
| 15 |
<!-- - [3/31/2025] 🔥 We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
|
|
@@ -42,10 +48,10 @@ The synthetic data generation scrips are:
|
|
| 42 |
- [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
|
| 43 |
|
| 44 |
## Data Preparation for UniFilter Training
|
| 45 |
-
UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data]().
|
| 46 |
|
| 47 |
## UniFilter Training
|
| 48 |
-
We develop the UniFilter training and scoring codebase based on [LLaVA-Unified]() repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
|
| 49 |
<!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
|
| 50 |
|
| 51 |
The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
|
|
@@ -97,6 +103,17 @@ Parameters to note:
|
|
| 97 |
- `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
|
| 98 |
- `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
|
| 99 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
## Acknowledgement
|
| 102 |
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
base_model:
|
| 3 |
- Qwen/Qwen2.5-1.5B-Instruct
|
| 4 |
- google/siglip-so400m-patch14-384
|
| 5 |
+
datasets:
|
| 6 |
+
- weizhiwang/unifilter_train_data
|
| 7 |
+
license: mit
|
| 8 |
+
pipeline_tag: image-text-to-text
|
| 9 |
+
library_name: transformers
|
| 10 |
---
|
| 11 |
+
|
| 12 |
# UniFilter
|
| 13 |
|
| 14 |
+
Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162) accepted by EMNLP 2025 Findings.
|
| 15 |
|
| 16 |
+
- 📝 [Paper](https://huggingface.co/papers/2510.15162)
|
| 17 |
+
- 🌐 [Project Page](https://victorwz.github.io/UniFilter)
|
| 18 |
+
- 💻 [GitHub Repository](https://github.com/Victorwz/UniFilter)
|
| 19 |
|
| 20 |
## Release
|
| 21 |
<!-- - [3/31/2025] 🔥 We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
|
|
|
|
| 48 |
- [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
|
| 49 |
|
| 50 |
## Data Preparation for UniFilter Training
|
| 51 |
+
UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data](https://huggingface.co/datasets/weizhiwang/unifilter_train_data).
|
| 52 |
|
| 53 |
## UniFilter Training
|
| 54 |
+
We develop the UniFilter training and scoring codebase based on [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
|
| 55 |
<!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
|
| 56 |
|
| 57 |
The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
|
|
|
|
| 103 |
- `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
|
| 104 |
- `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
|
| 105 |
|
| 106 |
+
## Citation
|
| 107 |
+
|
| 108 |
+
Please cite our paper if you find this repository interesting or helpful:
|
| 109 |
+
```bibtex
|
| 110 |
+
@article{UniFilter,
|
| 111 |
+
title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
|
| 112 |
+
author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
|
| 113 |
+
journal={arXiv preprint arXiv:2510.15162},
|
| 114 |
+
year={2025}
|
| 115 |
+
}
|
| 116 |
+
```
|
| 117 |
|
| 118 |
## Acknowledgement
|
| 119 |
|