You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

VQ-Token · LLaVA-OneVision 0.5B (Extreme Token Reduction)

VQToken Teaser

VQToken is a neural discrete token representation for video that enables extreme token reduction (~0.07% of dense tokens) while retaining strong downstream performance.
This repository hosts the 0.5B VQToken-enabled LLaVA-OneVision checkpoint.

🧠 Model Summary

Base backbone: LLaVA-OneVision (0.5B)
VQToken module: learns discrete video tokens; supports fixed / adaptive token budgets
Goal: reduce video token count dramatically while preserving vLLM accuracy
Interface: works with lmms-eval (preferred), and the modified LLaVA-OneVision loader in the project repo

🏗️ How this checkpoint was trained

Finetune script: finetune_ov_all.sh
Dataset: lmms-lab/LLaVA-Video-178K

The VQToken adapter is integrated with OneVision-0.5B and finetuned on the above dataset. See the training script for full hyperparameters and pipeline details.

🚀 Quick Test (CLI via lmms-eval)

We recommend testing with lmms-eval. The repo provides a ready-made script:

Script: test_vqtoken_0.5b.sh

Or run the equivalent command directly:

# env (adjust as needed)
export HF_HOME="/path/to/your/hf/cache"
export HF_TOKEN="your_hf_token_here"
export HF_HUB_ENABLE_HF_TRANSFER=1
# if any eval calls OpenAI endpoints
# export OPENAI_API_KEY="your_openai_key_here"

# Helpful on some single-GPU setups
export NCCL_P2P_DISABLE="1"
export NCCL_IB_DISABLE="1"

PRETRAIN=haichaozhang/VQ-Token-llava-ov-0.5b

CUDA_VISIBLE_DEVICES=0 accelerate launch --num_processes=1 --main_process_port 29509 \
  -m lmms_eval \
  --model llava_onevision_vqtoken \
  --model_args pretrained=$PRETRAIN,conv_template=qwen_1_5,model_name=llava_qwen \
  --tasks activitynetqa --batch_size 1 \
  --log_samples \
  --log_samples_suffix llava_onevision \
  --output_path ./logs_vqtoken/

You can swap --tasks for other video QA benchmarks supported by lmms-eval.

🧪 Minimal Python Inference

import copy, numpy as np, torch
from decord import VideoReader, cpu
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN
from llava.conversation import conv_templates

pretrained = "haichaozhang/VQ-Token-llava-ov-0.5b"
tok, model, imgproc, _ = load_pretrained_model(
    pretrained, None, "llava_qwen",
    device_map="auto", attn_implementation="sdpa", multimodal=True
)
model.eval()

def frames(path, n=16):
    vr = VideoReader(path, ctx=cpu(0))
    idx = np.linspace(0, len(vr)-1, n, dtype=int).tolist()
    return vr.get_batch(idx).asnumpy()  # (T,H,W,C)

video = "sample/demo.mp4"
vid = frames(video, 16)
pix = imgproc.preprocess(vid, return_tensors="pt")["pixel_values"].half().cuda()
images = [pix]

conv = copy.deepcopy(conv_templates["qwen_1_5"])
q = f"{DEFAULT_IMAGE_TOKEN}\\nDescribe what's happening in this video."
conv.append_message(conv.roles[0], q); conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

ids = tokenizer_image_token(prompt, tok, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()
sizes = [f.shape[:2] for f in vid]

with torch.no_grad():
    out = model.generate(
        ids, images=images, image_sizes=sizes,
        do_sample=False, temperature=0, max_new_tokens=512,
        modalities=["video"], vis=True
    )

print(tok.batch_decode(out, skip_special_tokens=True)[0])

📦 Intended Use & Notes

Use cases: video question answering, video captioning/understanding scenarios where token budget is tight.
Strengths: extreme token reduction (~0.07%) with competitive performance; fixed/adaptive regimes.
Out-of-scope / caveats: model may hallucinate or be brittle on out-of-distribution content; always validate on your task.

📊 Evaluation

We evaluate through lmms-eval for consistent, reproducible benchmarking. See repo logs and the paper for details on datasets, metrics, and token budgets (fixed vs. adaptive).

🔗 Resources

Paper (arXiv): https://arxiv.org/pdf/2503.16980
Project Page: https://www.zhanghaichao.xyz/VQToken/
Code: https://github.com/Hai-chao-Zhang/VQToken
Dataset: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
Test Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/test_vqtoken_0.5b.sh
Train Script: https://github.com/Hai-chao-Zhang/VQToken/blob/main/finetune_ov_all.sh

📚 Citation

@inproceedings{zhang2025vqtoken,
  title     = {VQToken: Neural Discrete Token Representation Learning for Extreme Token Reduction in Video Large Language Models},
  author    = {Haichao Zhang and Yun Fu},
  booktitle = {NeurIPS},
  year      = {2025}
}

🙏 Acknowledgements

Thanks to LLaVA-OneVision / LLaVA-NeXT and lmms-eval communities for open tooling and baselines.

Downloads last month: 98

Safetensors

Model size

1B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for haichaozhang/VQ-Token-llava-ov-0.5b

Base model

lmms-lab/llava-onevision-qwen2-0.5b-ov

Finetuned

(10)

this model

haichaozhang
/

VQ-Token-llava-ov-0.5b