Image-Text-to-Text
Transformers
Safetensors
llava_qwen
conversational

Improve model card: Add pipeline tag, library name, and essential links

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +23 -6
README.md CHANGED
@@ -1,15 +1,21 @@
1
  ---
2
- license: mit
3
- datasets:
4
- - weizhiwang/unifilter_train_data
5
  base_model:
6
  - Qwen/Qwen2.5-1.5B-Instruct
7
  - google/siglip-so400m-patch14-384
 
 
 
 
 
8
  ---
 
9
  # UniFilter
10
 
11
- Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data]() accepted by EMNLP 2025 Findings.
12
 
 
 
 
13
 
14
  ## Release
15
  <!-- - [3/31/2025] πŸ”₯ We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
@@ -42,10 +48,10 @@ The synthetic data generation scrips are:
42
  - [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
43
 
44
  ## Data Preparation for UniFilter Training
45
- UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data]().
46
 
47
  ## UniFilter Training
48
- We develop the UniFilter training and scoring codebase based on [LLaVA-Unified]() repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
49
  <!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
50
 
51
  The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
@@ -97,6 +103,17 @@ Parameters to note:
97
  - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
98
  - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
99
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ## Acknowledgement
102
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-1.5B-Instruct
4
  - google/siglip-so400m-patch14-384
5
+ datasets:
6
+ - weizhiwang/unifilter_train_data
7
+ license: mit
8
+ pipeline_tag: image-text-to-text
9
+ library_name: transformers
10
  ---
11
+
12
  # UniFilter
13
 
14
+ Official implementation of [Train a Unified Multimodal Data Quality Classifier with Synthetic Data](https://huggingface.co/papers/2510.15162) accepted by EMNLP 2025 Findings.
15
 
16
+ - πŸ“ [Paper](https://huggingface.co/papers/2510.15162)
17
+ - 🌐 [Project Page](https://victorwz.github.io/UniFilter)
18
+ - πŸ’» [GitHub Repository](https://github.com/Victorwz/UniFilter)
19
 
20
  ## Release
21
  <!-- - [3/31/2025] πŸ”₯ We released all pre-training data in webdataset format at [Open-Qwen2VL-Data](https://huggingface.co/datasets/weizhiwang/Open-Qwen2VL-Data).
 
48
  - [claude_sonnet_interleaved_data_generation.py](data_prepare/interleaved_data_scripts/claude_sonnet_interleaved_data_generation.py)
49
 
50
  ## Data Preparation for UniFilter Training
51
+ UniFilter is trained a large-scale set of (multimodal data example, quality score) pairs, which contains both caption data and interleaved document data. The synthetic multimodal example-score paired data are available at [UniFilter-Post-Train-Data](https://huggingface.co/datasets/weizhiwang/unifilter_train_data).
52
 
53
  ## UniFilter Training
54
+ We develop the UniFilter training and scoring codebase based on [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) repo, which is adapted from LLaVA with the support for recent LLMs and Vision Encoders.
55
  <!-- An additional [LlavaPhi3Classifier](LLaVA/llava/model/language_model/llava_phi3.py#235) class is customized as the model class for UniFilter. -->
56
 
57
  The architectural design of UniFilter contains three modules, the vision encoder, the visual projector, and the LLM Backbone. Different from a MLLM, the LLM Backbone does not have a language modeling head and we replace it with a score generation head. All these module parameters are specified with:
 
103
  - `--tar-file-path`: path to the webdataset image-text caption data or interleaved document data tars
104
  - `--tars-per-gpu`: the number of webdataset tars for a single-gpu to inference on
105
 
106
+ ## Citation
107
+
108
+ Please cite our paper if you find this repository interesting or helpful:
109
+ ```bibtex
110
+ @article{UniFilter,
111
+ title={Train a Unified Multimodal Data Quality Classifier with Synthetic Data},
112
+ author={Wang, Weizhi and Lin, Rongmei and Li, Shiyang and Lockard, Colin and Sarkhel, Ritesh and Lokegaonkar, Sanket and Shang, Jingbo and Yan, Xifeng and Zalmout, Nasser and Li, Xian},
113
+ journal={arXiv preprint arXiv:2510.15162},
114
+ year={2025}
115
+ }
116
+ ```
117
 
118
  ## Acknowledgement
119