AVoCaDO: An AudioVisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen^2,3,1*, Yue Ding^2,3, Weihong Lin¹, Jingyun Hua¹, Linli Yao⁴, Yang Shi⁴, Bozhou Li⁴, Yuanxing Zhang¹, Qiang Liu^2,3†, Pengfei Wan¹, Liang Wang^2,3, Tieniu Tan^2,3,5

¹Kling Team, Kuaishou Technology
²New Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences (CASIA)
³School of Artificial Intelligence, University of Chinese Academy of Sciences
⁴Peking University ⁵Nanjing University

^*This work was conducted during the author's internship at Kling Team, Kuaishou Technology
^†Corresponding author: qiang.liu@nlpr.ia.ac.cn

arXiv Code

🤗

Model

Abstract

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful AudioVisual video Captioner Driven by the temporal Orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

Captioning Case of AVoCaDO

Click to view the full generated caption The audio begins with the loud, enthusiastic sound of a crowd cheering and clapping... The man with the beard speaks with a friendly and slightly breathless tone, "Alright, guys... The video transitions to a black screen with a framed video window. Inside the window, a bald man with glasses and a black shirt is shown sitting in front of a red curtain, smiling...

Full caption:
The audio begins with the loud, enthusiastic sound of a crowd cheering and clapping. A man with a deep, energetic voice exclaims, "Give it up for Marlon Wayans!" The crowd's applause and excited shouts continue in the background. The video opens with a medium shot of two men standing in the center of a crowd outdoors. The man on the left, who is Black and wearing a light blue button-down shirt and sunglasses, gestures with his hands as he speaks. The man on the right, who is white with a beard and sunglasses, wears a black polo shirt and gestures emphatically with his hands while looking at the other man. The crowd, a mix of men and women, surrounds them, clapping and watching with interest. The setting is a bright, sunny day in what appears to be an urban plaza, with palm trees and buildings visible in the background. A red banner with white text is partially visible on the left. The camera remains steady, capturing the interaction between the two main subjects and the engaged audience.

The man with the beard speaks with a friendly and slightly breathless tone, "Alright, guys. So, it's hot. I want to do a trick with some money. You got your wallet?" The man in the blue shirt replies calmly, "Yeah, I got my wallet." "Alright, let me see it," the bearded man requests. "Alright," the other man agrees. The bearded man continues in a confident, instructional voice, "Cuz usually anytime you you grab money or... Wow. Okay, yeah. Alright, let's just do this. Anytime you grab for money, the number one thing when a magician grabs a dollar bill, what do they say?" The man in the blue shirt responds with a slightly confused and humorous tone, "I don't know, bro. I know a brother said, 'Give me my money back.'" The bearded man laughs lightly and continues, "Okay. Well, I tell you what, a lot of people say, 'Can you take that one and turn it into a hundred?'" "Okay," the other man agrees. "I'm not that good," the bearded man concedes. "So what I tell people is if you can just cut your expectation in half, I'll blow you away. So when I say in half, I literally mean in half. And we can take that one and turn it into a fifty." The camera zooms in for a close-up on the hands of the man in the black shirt. He holds a small, folded fifty-dollar bill between his fingers, displaying it to the crowd. The faces of the onlookers, including a woman with long dark hair, are visible in the blurred background, looking on with curiosity. The man unfolds the bill, revealing it to be a single dollar bill. He then folds it in half, and then in half again, demonstrating the size difference. The camera angle shifts slightly, showing the man continuing to manipulate the dollar bill, folding it into an even smaller rectangle. The crowd remains in the background, their faces a mix of anticipation and excitement. "See what I mean?" "Wow, that works," the man in the blue shirt says, sounding impressed. "Well, but but you still want to see the hundred, right?" the bearded man asks, his voice full of energy. "Yes," the other man confirms. "Cuz everybody wants to see the hundred. So that's when you take it this way, and sure enough, my man, we can take the one into a fifty into a hundred. You and I need to go into business together, man. Let's do it." The camera pulls back to the original medium shot. The man in the black shirt, now holding the folded dollar bill, gestures towards the man in the blue shirt. The crowd erupts in cheers and applause, with many people raising their hands and clapping enthusiastically. The man in the blue shirt smiles and gestures with open hands, looking pleased. The bearded man exclaims, "Take a dollar! Take a dollar! Take a dollar to a hundred! We're gonna be rich! Take a dollar! We're gonna be rich!" The crowd's cheers and applause swell in response. The man in the black shirt then turns and moves through the cheering crowd, who are now clapping and celebrating. The man in the blue shirt remains in the center, smiling and gesturing towards the camera. The bearded man continues to interact with the audience, pointing and gesturing energetically. The scene is filled with the sounds of celebration and excitement.

The video transitions to a black screen with a central, framed video window. Inside the window, a bald man with glasses and a black shirt is shown sitting in front of a red curtain, smiling and speaking directly to the camera. To the left of the window, the text "EPISODES & CLIPS" and "WATCH MORE" appears. Below the window, the word "SUBSCRIBE" is displayed. The background is dark with subtle, sparkling light effects. The audio then transitions to a different man speaking in a friendly and upbeat voice, as if addressing an online audience. "Hey YouTube, thanks for watching. If you like this and you want to see a lot more, all you got to do is click right here to subscribe." The man in the video window gestures with his hand as he speaks, concluding his message. The on-screen text and graphics remain the same.

AVoCaDO SFT

Figure 1: The pipeline for generating high-quality temporally-aligned audiovisual video captions. For clarity, corresponding audio-visual events before and after fusion are marked with circled numbers and underlined for reference.

AVoCaDO GRPO

Figure 2: Illustration of the three rewards \(R_C\), \(R_D\), and \(R_L\), which are specifically designed for improving the quality of audiovisual video captioning.

Experimental Results

Table 1: Model performance on the audiovisual video captioning benchmarks. "A" and "V" refer to the audio and visual modalities, respectively. The results presented above are reproduced using the official code. Note that the video-SALMONN-2 testset originally employed GPT-3.5 as the judge model, which occasionally led to misjudgments. To ensure more reliable evaluation, we uniformly replaced it with GPT-4.1. ^*Concurrent works with us.

Table 2: QA performance by Gemini-2.5-Pro based on textual captions. To mitigate answer guessing when the caption lacks necessary information, the model is instructed to refrain from answering such questions, which are then marked as incorrect samples.

Table 3: Model performance on the VDC Detailed subset and DREAM-1K, which evaluate captions in visual-only settings.

Ablation Studies

Table 4: Ablation study on our post-training pipeline. "Dlg. F1" represents the metric of dialogue quality. "RepCol" indicates the ratio of generations exhibiting repetition collapse. AVoCaDO-SFT-2K^* refers to the model further fine-tuned on AVoCaDO-SFT using the same 2K samples employed during the GRPO phase.

Table 5: Ablation study of "Refrain from Answering" (R.A.) mechanism in QA-based caption evaluation (cf. Table 2). When the judge model is not required to refrain from answering in cases where the caption lacks sufficient information, it often falls back on its internal knowledge or resorts to random guessing, leading to artificially inflated scores.

Additional Cases

Figure 3: An illustration of a video caption generated by AVoCaDO, featuring both precise audiovisual temporal alignment and accurate dialogue rendering.

Figure 4: Qualitative comparison of AVoCaDO against two contemporary captioning models: video-SALMONN-2 and UGC-VideoCaptioner. Errors in baseline outputs are highlighted in red; the superior coverage and precision of AVoCaDO are highlighted in blue. Correct / incorrect audiovisual temporal alignment is bolded, while sound effect descriptions are underlined.

Figure 5: Qualitative comparison of AVoCaDO against two contemporary captioning models: video-SALMONN-2 and UGC-VideoCaptioner. Errors in baseline outputs are highlighted in red; the superior coverage and precision of AVoCaDO are highlighted in blue. Correct / incorrect audiovisual temporal alignment is bolded, while sound effect descriptions are underlined.

BibTeX

@article{chen2025avocado,
  title={AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration},
  author={Chen, Xinlong and Ding, Yue and Lin, Weihong and Hua, Jingyun and Yao, Linli and Shi, Yang and Li, Bozhou and Zhang, Yuanxing and Liu, Qiang and Wan, Pengfei and others},
  journal={arXiv preprint arXiv:2510.10395},
  year={2025}
}