SaulSantos commited on
Commit
1585656
·
verified ·
1 Parent(s): efcfc2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +226 -3
README.md CHANGED
@@ -1,3 +1,226 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - multimodal
5
+ - multilingual
6
+ - llm
7
+ - vision
8
+ - vlm
9
+ - translation
10
+ language:
11
+ - en
12
+ - de
13
+ - nl
14
+ - es
15
+ - fr
16
+ - pt
17
+ - uk
18
+ - hi
19
+ - zh
20
+ - ru
21
+ - cs
22
+ - ko
23
+ - ja
24
+ - it
25
+ - pl
26
+ - ro
27
+ - nb
28
+ - nn
29
+ base_model:
30
+ - Unbabel/Tower-Plus-2B
31
+ pipeline_tag: image-text-to-text
32
+ ---
33
+
34
+ # Model Card for TowerVision
35
+
36
+ <p align="center">
37
+ <img src="Tower.png" alt="TowerVision Logo" width="200">
38
+ </p>
39
+
40
+ TowerVision is a family of open-source multilingual vision-language models with strong capabilities optimized for a variety of vision-language use cases, including image captioning, visual understanding, summarization, question answering, and more. **TowerVision excels particularly in multimodal multilingual translation benchmarks and culturally-aware tasks**, demonstrating exceptional performance across **20 languages and dialects**.
41
+
42
+ This model card covers the TowerVision family, including the 2B and 9B parameter versions, both in their instruct-tuned (it) and pretrained (pt) variants, with the latter not undergoing instruction tuning.
43
+
44
+ - **Point of Contact**: X (add some email here)
45
+ - **License**: Apache 2.0
46
+ - **Model Family**: TowerVision (2B, 9B variants)
47
+ - **Context length**: 8192 tokens
48
+ - **Languages**: 20+ languages including European, Asian, and other language families
49
+
50
+ <span style="font-size: 1.2em;"><strong>🌟 Try TowerVision</strong></span>: [Project Page](https://guilhermeviveiros.github.io/TowerVision.io/) | [Code Repository](https://github.com/GuilhermeViveiros/LLaVA-NeXT)
51
+
52
+ ## Available Models
53
+
54
+ <p align="left">
55
+
56
+ | Model | Parameters | HF Link |
57
+ |-------|------------|---------|
58
+ | TowerVision-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
59
+ | TowerVision-2B-pt | 2B | [🤗 utter-project/TowerVision-2B-pt](https://huggingface.co/utter-project/TowerVision-2B-pt)
60
+ | TowerVision-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
61
+ | TowerVision-9B-pt | 9B | [🤗 utter-project/TowerVision-9B-pt](https://huggingface.co/utter-project/TowerVision-9B-pt)
62
+ | TowerVideo-2B | 2B | [🤗 utter-project/TowerVision-2B](https://huggingface.co/utter-project/TowerVision-2B)
63
+ | TowerVideo-9B | 9B | [🤗 utter-project/TowerVision-9B](https://huggingface.co/utter-project/TowerVision-9B)
64
+
65
+ ## How to Use TowerVision
66
+
67
+ ### Quick Start with Transformers
68
+
69
+ <details open>
70
+ <summary>Click to expand/collapse code</summary>
71
+
72
+ ```python
73
+ mport torch
74
+ from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration
75
+
76
+ # Load the model in half-precision
77
+ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
78
+ "/mnt/data-poseidon/saul/towerpvideo_hf",
79
+ device_map="auto"
80
+ )
81
+
82
+ processor = AutoProcessor.from_pretrained(
83
+ "utter-project/TowerVideo-7B"
84
+ )
85
+
86
+ # Use your local video
87
+ video_path = "your_video_path.mp4"
88
+
89
+ # Conversation using the same template
90
+ conversation = [
91
+ {
92
+ "role": "user",
93
+ "content": [
94
+ {"type": "video", "path": video_path},
95
+ {"type": "text", "text": "\n<video>\nIWhat is the video about?"},
96
+ ],
97
+ },
98
+ ]
99
+
100
+ # Apply the chat template
101
+ inputs = processor.apply_chat_template(
102
+ conversation,
103
+ num_frames=8,
104
+ add_generation_prompt=True,
105
+ tokenize=True,
106
+ return_dict=True,
107
+ add_special_tokens=True, # ensures <video> token is inserted
108
+ return_tensors="pt"
109
+ ).to(model.device, torch.float16)
110
+
111
+ # Generate response
112
+ out = model.generate(**inputs, max_new_tokens=60)
113
+
114
+ # Decode output
115
+ decoded = processor.batch_decode(
116
+ out,
117
+ skip_special_tokens=True,
118
+ clean_up_tokenization_spaces=True
119
+ )
120
+
121
+ print(decoded)
122
+ ```
123
+
124
+ </details>
125
+
126
+ ## Model Details
127
+
128
+ **Input**: Model accepts input text, images and video.
129
+
130
+ **Output**: Model generates text in multiple languages.
131
+
132
+ **Model Architecture**: TowerVision uses a multilingual image-language model based on [Tower-Plus](https://huggingface.co/utter-project/TowerVision-2B) (2B and 9B parameters), paired with [SigLIP2-patch14-384](https://huggingface.co/google/siglip2-so400m-patch14-384) vision encoder through a multimodal adapter for vision-language understanding.
133
+
134
+ **Recommended Precision**: We recommend using `bfloat16` precision for optimal performance and memory efficiency when running TowerVision models.
135
+
136
+ **Languages Covered**: The model has been trained on **20 languages and dialects**:
137
+ - **European languages**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian (Bokmål & Nynorsk)
138
+ - **Asian languages**: Chinese (Simplified & Traditional), Japanese, Korean, Hindi
139
+ - **Other languages**: Russian, Ukrainian
140
+
141
+ **Key Strengths**:
142
+ - **🏆 Exceptional performance on culturally-aware benchmarks** with deep understanding of cultural contexts and visual nuances
143
+ - **📊 Strong cross-lingual transfer capabilities** across diverse vision-language tasks
144
+
145
+ ## Training Data
146
+
147
+ TowerVision models are trained on **VisionBlocks**, a comprehensive multilingual vision-language dataset comprising **6.31M samples** across diverse categories:
148
+
149
+ | Dataset | Samples | HF Link | |
150
+ |---------|---------|---------|-------|
151
+ | VisionBlocks | 6.31M | [🤗 utter-project/VisionBlocks](https://huggingface.co/datasets/utter-project/VisionBlocks) | Coming Soon |
152
+
153
+ ### Dataset Statistics
154
+ - **Total samples**: 6.31M
155
+ - **Created by our team**: 1.21M samples (~19%)
156
+ - **Human-collected/external**: 5.10M samples (~81%)
157
+
158
+ ### Dataset Composition Overview
159
+
160
+ **VisionBlocks** contains samples across multiple categories with both English-only (63.1%) and multilingual (36.9%) data:
161
+
162
+ - **Chart/Plot Reasoning**: DVQA, ChartQA, PlotQA, TabMWP (~405K samples)
163
+ - **General VQA**: VQAv2, RLAIF-4V (~488K samples)
164
+ - **Document VQA**: DocVQA, TextVQA, ST-VQA, PixMo-Docs (~46K samples)
165
+ - **Reasoning/Knowledge**: A-OKVQA, OKVQA, AI2D, ScienceQA (~29K samples)
166
+ - **Multilingual/Cultural**: Pangea-Cultural, Pangea-Multi, PixMo-Cap-Translated, CulturalGround datasets (~1.6M samples)
167
+ - **Specialized VQA**: IconQA, InfographicVQA, Stratos (~34K samples)
168
+ - **Counting/Math**: TallyQA, PixMo-Count (~107K samples)
169
+ - **Vision/Text**: VBlocks-PixMo collections, EuroBlocks-SFT (~2.2M samples)
170
+ - **Video/Text**: LLaVA-Video collections (~1.4M samples)
171
+
172
+ **Collection Types**: Human-annotated, synthetically generated, and professionally translated data ensuring high quality and cultural diversity across 20+ languages.
173
+
174
+ ## Evaluation
175
+
176
+ All evaluations were conducted using [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval).
177
+
178
+ ### Multiple Purpose Multimodal Benchmarks
179
+
180
+ TowerVision demonstrates strong performance across diverse multimodal evaluation benchmarks:
181
+
182
+ <img src="mc-eval1.png" alt="Multiple Purpose Multimodal Benchmarks Results" width="600">
183
+
184
+ ### Multimodal Multilingual Translation Tasks
185
+
186
+ TowerVision excels particularly in multimodal multilingual translation benchmarks, demonstrating state-of-the-art cross-lingual visual communication capabilities:
187
+
188
+ <img src="mc-eval2.png" alt="Multimodal Multilingual Translation Results" width="600">
189
+
190
+ ### Supported Languages Performance
191
+
192
+ ✅ **Fully Supported**: English, German, Dutch, Spanish, French, Portuguese, Italian, Polish, Czech, Romanian, Norwegian, Chinese, Japanese, Korean, Hindi, Russian, Ukrainian
193
+
194
+ 📊 **Benchmark Coverage**: Our models are evaluated across diverse multilingual vision-language tasks, demonstrating strong cross-lingual transfer capabilities and exceptional performance in culturally-aware benchmarks.
195
+
196
+ ## Citation
197
+
198
+ If you find TowerVideo useful in your research, please consider citing the following paper:
199
+
200
+ ```bibtex
201
+ @article{towervision2025,
202
+ title={Understanding and Improving Multilinguality in Vision-Language Models},
203
+ author={[Authors to be added]},
204
+ journal={[Journal to be added]},
205
+ year={2025},
206
+ note={Paper in preparation}
207
+ }
208
+ ```
209
+
210
+ ## Model Card Contact
211
+
212
+ For errors or additional questions about details in this model card, contact the research team.
213
+
214
+ ## Terms of Use
215
+
216
+ We hope that the release of this model will make community-based research efforts more accessible by releasing the weights of highly performant multilingual vision-language models to researchers all over the world.
217
+
218
+ This model is governed by the Apache 2.0 License.
219
+
220
+ ## Acknowledgments
221
+
222
+ TowerVision builds upon the excellent work of:
223
+ - **[LLaVA-NeXT](https://github.com/GuilhermeViveiros/LLaVA-NeXT)** for the foundational vision-language architecture
224
+ - **[Tower-Plus](https://huggingface.co/Unbabel/Tower-Plus-9B)** language models for multilingual capabilities
225
+ - **[SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384)** for robust vision encoding
226
+ - The broader multilingual NLP and multimodal communities