Method Overview

Our delegate-and-conquer strategy achieves both efficiency and accuracy in Long Video Temporal Grounding.
Specifically, we introduce a sidekick encoder that is capable of extracting dense clip features at a substantially reduced computational cost. Simultaneously, a text encoder obtains features for the input text query. Next, we create a saliency map with the dense feature and text features to create a saliency map over the video clips and identify the top-c% salient clips for the input query. Lastly, we leverage a pretrained expert encoder to process only the salient clips to extract sparse, salient features.
The dense features and the sparse salient features exist at different temporal resolutions. To ensure effective grounding, we introduce Decaf-Grounder that unifies the two features along with the input query features via Query-aware Temporal Aggregation and refines them over varied temporal scales using Multi-Scale Refinement.