VQ-Token Β· LLaVA-OneVision 0.5B (Extreme Token Reduction)
VQToken is a neural discrete token representation for video that enables extreme token reduction (~0.07% of dense tokens) while retaining strong downstream performance.
This repository hosts the 0.5B VQToken-enabled LLaVA-OneVision checkpoint.
π§ Model Summary
- Base backbone: LLaVA-OneVision (0.5B)
- VQToken module: learns discrete video tokens; supports fixed / adaptive token budgets
- Goal: reduce video token count dramatically while preserving vLLM accuracy
- Interface: works with lmms-eval (preferred), and the modified LLaVA-OneVision loader in the project repo
ποΈ How this checkpoint was trained
- Finetune script:
finetune_ov_all.sh - Dataset:
lmms-lab/LLaVA-Video-178K
The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details.
π Quick Test (CLI via lmms-eval)
We recommend testing with lmms-eval. The repo provides a ready-made script:
- Script:
test_vqtoken_0.5b.sh
Or run the equivalent command directly:
# env (adjust as needed)
export HF_HOME="/path/to/your/hf/cache"
export HF_TOKEN="your_hf_token_here"
export HF_HUB_ENABLE_HF_TRANSFER=1
# if any eval calls OpenAI endpoints
# export OPENAI_API_KEY="your_openai_key_here"
# Helpful on some single-GPU setups
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"
PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b
CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \
-m lmms_eval \
--model llava_onevision_vqtoken \
--model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \
--tasks activitynetqa --batch_size 1 \
--log_samples \
--log_samples_suffix llava_onevision \
--output_path ./logs_vqtoken/
You can swap
--tasksfor other video QA benchmarks supported by lmms-eval.
π§ͺ Minimal Python Inference
import copy, numpy as np, torch
from decord import VideoReader, cpu
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates
pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b"
tok, model, imgproc, _ = load_pretrained_model(
pretrained, None, "llava_qwen",
device_map="auto", attn_implementation="sdpa", multimodal=True
)
model.eval()
def frames(path, n=16):
vr = VideoReader(path, ctx=cpu(0))
idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist()
return vr.get_batch(idx).asnumpy() # (T,H,W,C)
video = "sample/demo.mp4"
vid = frames(video, 16)
pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda()
images = [pix]
conv = copy.deepcopy(conv_templates["qwen_1_5"])
q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video."
conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
sizes = [f.shape[:2] for f in vid]
with torch.no_grad():
out = model.generate(
ids, images=images, image_sizes=sizes,
do_sample=False, temperature=0, max_new_tokens=512,
modalities=["video"], vis=True
)
print(tok.batch_decode(out, skip_special_tokens=True)[0])
π¦ Intended Use & Notes
- Use cases: video question answering, video captioning/understanding scenarios where token budget is tight.
- Strengths: extreme token reduction (~0.07%) with competitive performance; fixed/adaptive regimes.
- Out-of-scope / caveats: model may hallucinate or be brittle on out-of-distribution content; always validate on your task.
π Evaluation
We evaluate through lmms-eval for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive).
π Resources
- Paper (arXiv): https://arxiv.org/pdf/2503.16980
- Project Page: https://www.zhanghaichao.xyz/VQToken/
- Code: https://github.com/Hai-chao-Zhang/VQToken
- Dataset: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- Test Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh
- Train Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh
π Citation
@inproceedings{zhang2025vqtoken,
title = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models},
author = {Haichao Zhang and Yun Fu},
booktitle = {NeurIPS},
year = {2025}
}
π Acknowledgements
Thanks to LLaVA-OneVision / LLaVA-NeXT and lmms-eval communities for open tooling and baselines.
- Downloads last month
- 98
Model tree for haichaozhang/VQ-Token-llava-ov-0.5b
Base model
lmms-lab/llava-onevision-qwen2-0.5b-ov