DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

1Microsoft, 2Northeastern University
CVPR 2025

*Indicates Equal Contribution
DeCafNet Teaser Image

Long Video Temporal Grounding (LVTG) aims at identifying specific moments within long videos based on text queries. Existing approaches divide video into clips and process each clip via a full-scale expert encoder, creating prohibitive computational costs and is challenging to scale.

We introduce DeCafNet, an approach employing delegate-and-conquer strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that exist at different temporal resolutions, we introduce DeCaf-Grounder, which unifies and refines them via query-aware temporal aggregation and multi-scale temporal refinement for accurate grounding. Experiments on five LTVG benchmark datasets demonstrate that DeCafNet reduces computation by up to 47% while still outperforming existing methods.

Method Overview

DeCafNet Method Overview

Our delegate-and-conquer strategy achieves both efficiency and accuracy in Long Video Temporal Grounding.

Specifically, we introduce a sidekick encoder that is capable of extracting dense clip features at a substantially reduced computational cost. Simultaneously, a text encoder obtains features for the input text query. Next, we create a saliency map with the dense feature and text features to create a saliency map over the video clips and identify the top-c% salient clips for the input query. Lastly, we leverage a pretrained expert encoder to process only the salient clips to extract sparse, salient features.

The dense features and the sparse salient features exist at different temporal resolutions. To ensure effective grounding, we introduce Decaf-Grounder that unifies the two features along with the input query features via Query-aware Temporal Aggregation and refines them over varied temporal scales using Multi-Scale Refinement.

Experimental Results

DeCafNet sets new SOTA on 5 temporal video grounding benchmarks, including Ego4D-NLQ, Ego4D-Goalstep, MAD, Charades-STG, and TACoS. More importantly, it also significantly reduces computational costs. On Ego4D-NLQ, DeCafNet is close to the prior best when selecting 30% salient clips, reducing TFLOPS by 66%. DeCafNet surpasses the previous works when selecting 50% salient clips, while still reducing computation by 47%.

Qualitative

DeCafNet Qualitative Results

This figure presents qualitative results from our model, with saliency maps shown below our predictions. "Ours wo DCG" and "Ours w DCG" denote using conventional grounder design or our Decaf-Grounder design. We compare our results against SnAG. Notably, DeCafNet’s saliency maps are accurate and consistently align with the ground truth. Even when considering only the top 30% of salient clips, these clips still capture the ground truth, highlighting the effectiveness of our dual-encoder design. As illustrated in the second row, predictions with conventional grounder design can be inaccurate as they do not handle inputs of varying temporal resolutions. Decaf-Grounder effectively corrects these inaccuracies.

BibTeX

@inproceedings{
        Lu2025DeCafNet,
        title={DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos},
        author={Zijia Lu and A S M Iftekhar and Gaurav Mittal and Tianjian Meng and Xiawei Wang and Cheng Zhao and Rohith Kukkala and Ehsan Elhamifar and Mei Chen},
        booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
        year={2025},
        }