Streaming Video Instruction Tuning

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Streaming Video Instruction Tuning

Jiaer Xia^1*, Peixian Chen^2*, Mengdan Zhang², Xing Sun², Kaiyang Zhou¹

¹Hong Kong Baptist University
²Tencent Youtu Lab
^*Equal Contribution

Paper Code Hugging Face

Streamo is a real-time streaming video LLM that serves as a general-purpose interactive assistant.

Abstract

We present Streamo, a real-time streaming video LLM that serves as a general-purpose interactive assistant. Unlike existing online video models that focus narrowly on question answering or captioning, Streamo performs a broad spectrum of streaming video tasks, including real-time narration, action understanding, event captioning, temporal event grounding, and time-sensitive question answering. To develop such versatility, we construct Streamo-Instruct-465K, a large-scale instruction-following dataset tailored for streaming video understanding. The dataset covers diverse temporal contexts and multi-task supervision, enabling unified training across heterogeneous streaming tasks. After training end-to-end on the instruction-following dataset through a streamlined pipeline, Streamo exhibits strong temporal reasoning, responsive interaction, and broad generalization across a variety of streaming benchmarks. Extensive experiments show that Streamo bridges the gap between offline video perception models and real-time multimodal assistants, making a step toward unified, intelligent video understanding in continuous video streams.

Multi-task annotation in Streamo-Instruct-465K.

Streamo's architecture. Streaming video data is organized into an interleaved, multi-turn dialogue struc ture that directly integrates a response-state token into the data sequence, enabling end-to-end parallel training.

Comparison with state-of-the-art on OVO-Bench.

Caption Task Demo Video

Cooking Demo Video

Streamo Paper

BibTeX

@article{xia2025streaming,
  title={Streaming Video Instruction Tuning},
  author={Xia, Jiaer and Chen, Peixian and Zhang, Mengdan and Sun, Xing and Zhou, Kaiyang},
  journal={arXiv preprint arXiv:2512.21334},
  year={2025}
}

More Works from Our Lab

Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning

Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding

Streaming Video Instruction Tuning

Streamo is a real-time streaming video LLM that serves as a general-purpose interactive assistant.

Abstract

Multi-task annotation in Streamo-Instruct-465K.

Streamo's architecture. Streaming video data is organized into an interleaved, multi-turn dialogue struc ture that directly integrates a response-state token into the data sequence, enabling end-to-end parallel training.

Comparison with state-of-the-art on OVO-Bench.

Caption Task Demo Video

Cooking Demo Video

Streamo Paper

BibTeX