Scaling Open-Vocabulary Action Detection

1University of Central Florida

In-the-wild results of our model, SiA.

Abstract

In this work, we focus on scaling open-vocabulary action detection. Existing approaches for action detection are predominantly limited to closed-set scenarios and rely on complex, parameter-heavy architectures. Extending these models to the open-vocabulary setting poses two key challenges: (1) the lack of large-scale datasets with many action classes for robust training, and (2) parameter-heavy adaptations to a pretrained vision-language contrastive model to convert it for detection, risking overfitting the additional non-pretrained parameters to base action classes.

Firstly, we introduce an encoder-only multimodal model for video action detection, reducing the reliance on parameter-heavy additions for video action detection.

Secondly, we introduce a simple weakly supervised training strategy to exploit an existing closed-set action detection dataset for pretraining.

Finally, we depart from the ill-posed base-to-novel benchmark used by prior works in open-vocabulary action detection and devise a new benchmark to evaluate on existing closed-set action detection datasets without ever using them for training, showing novel results to serve as baselines for future work.

Model

Our model is SiA, a simple architecture for open-vocabulary action detection. Key features:

  • Multi-modal (video and text)
  • Lightweight, encoder-only design
  • End-to-end, single-stage detection
  • Open-vocabulary, detect any human action

Training with Weak-Supervision

Existing action detection datasets do not have enough actions for training (the largest is AVA/AVA-Kinetics, with only 80 actions). Our weakly-supervised training scheme exploits the Kinetics-700 videos in AVA-Kinetics, allowing more than 700 actions to be used for training, improving the generalizability of our model.

Benchmarks

Open-vocabulary results

Given the lack of prior work that has open-vocabulary results on all 4 downstream datasets, we compare our model against a simple training-free baseline, using different video-language models as action classifiers (* denotes only human actions used).

Closed-set results

We also finetune our model on all 4 downstream datasets to show that it is sufficient for the task of closed-set action detection.

Related Work

Open-vocabulary action detection is currently an underdeveloped field primarily due to the lack of large-scale datasets with a large number of human actions. Nevertheless, there are a few models in this field:

iCLIP is the first work to extend video action detection to the vision-language domain, by freezing CLIP image and text encoders and introducing external modules to convert them for human action detection.

OpenMixer extends the closed-set action detection model STMixer to the vision-language domain, using a frozen CLIP-VIP video backbone in the AdaMixer-style encoder-decoder architecture.

Both models are trained in a base-to-novel manner; UCF-101-24 or JHMDB is used and their videos are split into base actions for training and novel actions for validation. We believe this base-to-novel benchmark hinders generalizability, since less than 20 actions can be used for training in this manner. In contrast, our model has seen more than 700 actions during training.

BibTeX

@article{sia2025,
  author    = {Zhen Hao Sia and Yogesh Singh Rawat},
  title     = {Scaling Open-Vocabulary Action Detection},
  journal   = {arXiv preprint arXiv:2504.03096},
  year      = {2025},
}