DisTime: Distribution-based Time Representation for Video Large Language Models

Abstract

Despite advances in general video understanding, Video Large Language Models (Video-LLMs) face challenges in precise temporal localization due to discrete time representations and limited temporally aware datasets. Existing methods for temporal expression either conflate time with text-based numerical values, add a series of dedicated temporal tokens, or regress time using specialized temporal grounding heads. To address these issues, we introduce DisTime, a lightweight framework designed to enhance temporal comprehension in Video-LLMs. DisTime employs a learnable token to create a continuous temporal embedding space and incorporates a Distribution-based Time Decoder that generates temporal probability distributions, effectively mitigating boundary ambiguities and maintaining temporal continuity. Additionally, the Distribution-based Time Encoder re-encodes timestamps to provide time markers for Video-LLMs. To overcome temporal granularity limitations in existing datasets, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. This leads to the creation of InternVid-TG, a substantial dataset with 1.25M temporally grounded events across 179k videos, surpassing ActivityNet-Caption by 55 times. Extensive experiments demonstrate that DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks while maintaining competitive performance in Video QA tasks.

Usage

You can easily load the model using the transformers library. The following example demonstrates how to perform inference with DisTime:

import numpy as np
import torch
from transformers import AutoTokenizer, AutoModel, AutoProcessor
from decord import cpu, VideoReader

# Load the model and processor
tokenizer = AutoTokenizer.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)
model = AutoModel.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
processor = AutoProcessor.from_pretrained("UserJoseph/DisTime-8B", trust_remote_code=True)

model.eval()
video_path = "./examples/video1.mp4" # Replace with your video path
qs = "Describe this video in detail"

# Load video frames
vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
fps = float(vr.get_avg_fps())
frame_indices = np.array([i for i in range(0, len(vr), round(fps))]) # Sample frames at 1 fps
video = [vr[frame_index].asnumpy() for frame_index in frame_indices]
video = np.stack(video)

# Prepare inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": video},
            {"type": "text", "text": qs},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
video_inputs = processor.process_video(messages) # Process video frames
inputs = processor(text=[text], videos=video_inputs, padding=True, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate output
with torch.inference_mode():
    output_ids = model.generate(
        **inputs,
        do_sample=False,
        temperature=0.2,
        max_new_tokens=128,
        use_cache=True,
    )

pred = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
print(pred)

Models and Data

Models

InternVid-TG

In this paper, we propose an automated annotation paradigm that combines the captioning capabilities of Video-LLMs with the localization expertise of dedicated temporal models. With these methods, we construct the InternVid-TG dataset. The dataset is released at https://huggingface.co/datasets/yingsen/internvid-tg.

Citation

@article{zeng2025distime,
  title={DisTime: Distribution-based Time Representation for Video Large Language Models},
  author={Zeng, Yingsen and Huang, Zepeng and Zhong, Yujie and Feng, Chengjian and Hu, Jie and Ma, Lin and Liu, Yang},
  journal={arXiv preprint arXiv:2505.24329},
  year={2025}
}

Acknowledgement

DisTime is developed with the codebases of the following projects: InternVL and LLaVA-NeXT. We would like to express our sincere gratitude to these open-source contributions, which have greatly facilitated our research and exploration of time representation for video large language models.

Downloads last month: 5

Safetensors

Model size

8B params

Tensor type

BF16

Inference Providers NEW

Video-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support