---
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
pipeline_tag: video-text-to-text
base_model: Qwen/Qwen2.5-VL-7B-Instruct
---
# 💡 VideoChat-R1_5-7B
[\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-R1)
[\[📜 Tech Report\]](https://arxiv.org/pdf/2509.21100v1)
## 🚀 How to use the model
We provide a simple installation example below:
```
pip install transformers
```
Using qwen_vl_utils in https://github.com/OpenGVLab/VideoChat-R1/blob/main/Videochat-R1.5/src_eval/my_vision_process.py
Then you could use our model:
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "OpenGVLab/VideoChat-R1_5"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path, torch_dtype="auto", device_map="auto",
attn_implementation="flash_attention_2"
)
# default processer
processor = AutoProcessor.from_pretrained(model_path)
video_path = "your_video.mp4"
question = "your_qa.mp4"
num_percptions = 3
QA_THINK_GLUE = """Answer the question: "[QUESTION]" according to the content of the video.
Output your think process within the tags.
Then, provide your answer within the tags, output the corresponding letter of the option. At the same time, in the tags, present the precise time period in seconds of the video clips on which you base your answer to this question in the format of [(s1, e1), (s2, e2), ...]. For example: ...A[(5.2, 10.4)].
"""
QA_THINK = """Answer the question: "[QUESTION]" according to the content of the video.
Output your think process within the tags.
Then, provide your answer within the tags, output the corresponding letter of the option. For example: ...A[(5.2, 10.4)].
"""
def inference(video_path, prompt, model, processor, max_new_tokens=2048, device="cuda:0", client = None, pred_glue=None):
messages = [
{"role": "user", "content": [
{"type": "video",
"video": video_path,
'key_time':pred_glue,
"total_pixels": 128*12 * 28 * 28,
"min_pixels": 128 * 28 * 28,
},
{"type": "text", "text": prompt},
]
},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True, client = client)
fps_inputs = video_kwargs['fps']
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, fps=fps_inputs, padding=True, return_tensors="pt")
inputs = inputs.to(device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, use_cache=True)
generated_ids = [output_ids[i][len(inputs.input_ids[i]):] for i in range(len(output_ids))]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
return output_text[0]
for percption in range(num_percptions):
if percption == num_percptions - 1:
example_prompt = QA_THINK.replace("[QUESTION]", item["problem"]["question"])
else:
example_prompt = QA_THINK_GLUE.replace("[QUESTION]", item["problem"]["question"])
ans = inference(video_path, example_prompt, model, processor, device=device, client=client, pred_glue=pred_glue)
pattern_glue = r'(.*?)'
match_glue = re.search(pattern_glue, ans, re.DOTALL)
# print(f'ann:{ans}')
answers.append(ans)
pred_glue = None
try:
if match_glue:
glue = match_glue.group(1)
pred_glue = ast.literal_eval(glue)
except Exception as e:
pred_glue = None
print(ans)
```
## ✏️ Citation
If you find this project useful in your research, please consider cite:
```BibTeX
@article{li2025videochatr1,
title={VideoChat-R1: Enhancing Spatio-Temporal
Perception via Reinforcement Fine-Tuning},
author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
journal={arXiv preprint arXiv:2504.06958},
year={2025}
}
@article{yan2025videochatr15,
title={VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception},
author={Yan, Ziang and Li, Xinhao and He, Yinan and Zhengrong Yue and Zeng, Xiangyu and Wang, Yali and Qiao, Yu and Wang, Limin and Wang, Yi},
journal={arXiv preprint arXiv:2509.21100},
year={2025}
}
```