Official PyTorch Implementation
🎉 EZ-CLIP has evolved into T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning and is now published in Transactions on Machine Learning Research (TMLR) 2025!
We’ve released a new, enhanced codebase for T2L, incorporating the latest advancements. Visit the new repository for the most up-to-date code and resources:
👉 T2L Repository 👈
This EZ-CLIP repository remains available for reference but may not receive further updates. Explore T2L for the cutting-edge implementation!
- 📦 Trained Models: Download pre-trained models from Google Drive.
- 📄 Published Paper: See details in the TMLR 2025 publication and new T2L repository.
EZ-CLIP is an innovative adaptation of CLIP tailored for zero-shot video action recognition. By leveraging temporal visual prompting, it seamlessly integrates temporal dynamics while preserving CLIP’s powerful generalization. A novel motion-focused learning objective enhances its ability to capture video motion, all without altering CLIP’s core architecture.
For the latest advancements, check out T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning in the T2L Repository.
EZ-CLIP tackles the challenge of adapting CLIP for zero-shot video action recognition with a lightweight and efficient approach. Through temporal visual prompting and a specialized learning objective, it captures motion dynamics effectively while retaining CLIP’s generalization capabilities. This makes EZ-CLIP both practical and powerful for video understanding tasks.
The work has been significantly advanced in our TMLR 2025 publication, T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning. Explore the T2L Repository for the latest developments.
Set up the environment using the provided requirements.txt
:
pip install -r requirements.txt
Note: All models are based on the publicly available ViT/B-16 CLIP model.
Trained on Kinetics-400 and evaluated on downstream datasets.
Model | Input | HMDB-51 | UCF-101 | Kinetics-600 | Model Link |
---|---|---|---|---|---|
EZ-CLIP (ViT-16) | 8x224 | 52.9 | 79.1 | 70.1 | Link |
Datasets are split into base and novel classes, with models trained on base classes and evaluated on both.
Dataset | Input | Base Acc. | Novel Acc. | HM | Model Link |
---|---|---|---|---|---|
K-400 | 8x224 | 73.1 | 60.6 | 66.3 | Link |
HMDB-51 | 8x224 | 77.0 | 58.2 | 66.3 | Link |
UCF-101 | 8x224 | 94.4 | 77.9 | 85.4 | Link |
SSV2 | 8x224 | 16.6 | 13.3 | 14.8 | Link |
Extract videos into frames for efficient processing. See the Dataset_creation_scripts
directory for instructions.
Supported datasets:
Train EZ-CLIP with:
python train.py --config configs/K-400/k400_train.yaml
Evaluate a trained model with:
python test.py --config configs/ucf101/UCF_zero_shot_testing.yaml
If you find this code or models useful, please cite our work:
TMLR 2025 Publication:
@article{
ahmad2025tl,
title={T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning},
author={Shahzad Ahmad and Sukalpa Chanda and Yogesh S Rawat},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=WvgoxpGpuU},
note={}
}
arXiv Preprint:
@article{ahmad2023ezclip,
title={EZ-CLIP: Efficient Zero-Shot Video Action Recognition},
author={Ahmad, Shahzad and Chanda, Sukalpa and Rawat, Yogesh S},
journal={arXiv preprint arXiv:2312.08010},
year={2023}
}
This codebase builds upon ActionCLIP. We express our gratitude to the authors for their foundational contributions.
For the latest updates, visit the T2L Repository.
Contact: For questions or issues, please open an issue on this repository or the T2L Repository.
Explore the Future of Zero-Shot Action Recognition with T2L!
👉 T2L Repository 👈