Improve model card: Add pipeline tag, library name, and essential links

This PR significantly enhances the model card for UniFilter by:
- Adding `pipeline_tag: image-text-to-text` for better discoverability and categorization on the Hub.
- Adding `library_name: transformers` to enable the automated "how to use" widget, as the model is compatible with the `transformers` library.
- Updating the existing placeholder paper link to the official Hugging Face paper page.
- Including direct links to the project page and GitHub repository for comprehensive information.
- Adding the `Citation` section for proper attribution.

Files changed (1) hide show

README.md +23 -6

README.md CHANGED Viewed

@@ -1,15 +1,21 @@
 ---
-license: mit
-datasets:
-- weizhiwang/unifilter_train_data
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
 - google/siglip-so400m-patch14-384
 ---
 # UniFilter
-Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data]() accepted by EMNLP 2025 Findings.
 ## Release
 <!-- - [3/31/2025] 🔥 We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
@@ -42,10 +48,10 @@ The synthetic data generation scrips are:
  - [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
 ## Data Preparation for UniFilter Training
-UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data]().
 ## UniFilter Training
-We develop the UniFilter training and scoring codebase based on [LLaVA-Unified]() repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
 <!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
 The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
@@ -97,6 +103,17 @@ Parameters to note:
 - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
 - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
 ## Acknowledgement

 ---
 base_model:
 - Qwen/Qwen2.5-1.5B-Instruct
 - google/siglip-so400m-patch14-384
+datasets:
+- weizhiwang/unifilter_train_data
+license: mit
+pipeline_tag: image-text-to-text
+library_name: transformers
 ---
 # UniFilter
+Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162) accepted by EMNLP 2025 Findings.
+- 📝 [Paper](https://huggingface.co/papers/2510.15162)
+- 🌐 [Project Page](https://victorwz.github.io/UniFilter)
+- 💻 [GitHub Repository](https://github.com/Victorwz/UniFilter)
 ## Release
 <!-- - [3/31/2025] 🔥 We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
  - [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
 ## Data Preparation for UniFilter Training
+UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data](https://huggingface.co/datasets/weizhiwang/unifilter_train_data).
 ## UniFilter Training
+We develop the UniFilter training and scoring codebase based on [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
 <!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
 The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
 - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
 - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
+## Citation
+Please cite our paper if you find this repository interesting or helpful:
+```bibtex
+@article{UniFilter,
+   title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
+   author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
+   journal={arXiv preprint arXiv:2510.15162},
+   year={2025}
+ }
+```
 ## Acknowledgement