EZ-CLIP: Efficient Zero-Shot Video Action Recognition

Official PyTorch Implementation

🚀 Major Announcement: Published in TMLR 2025!

🎉 EZ-CLIP has evolved into T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning and is now published in Transactions on Machine Learning Research (TMLR) 2025!
We’ve released a new, enhanced codebase for T2L, incorporating the latest advancements. Visit the new repository for the most up-to-date code and resources:
👉 T2L Repository 👈

This EZ-CLIP repository remains available for reference but may not receive further updates. Explore T2L for the cutting-edge implementation!

Updates

📦 Trained Models: Download pre-trained models from Google Drive.
📄 Published Paper: See details in the TMLR 2025 publication and new T2L repository.

Overview

EZ-CLIP is an innovative adaptation of CLIP tailored for zero-shot video action recognition. By leveraging temporal visual prompting, it seamlessly integrates temporal dynamics while preserving CLIP’s powerful generalization. A novel motion-focused learning objective enhances its ability to capture video motion, all without altering CLIP’s core architecture.

For the latest advancements, check out T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning in the T2L Repository.

Introduction

EZ-CLIP tackles the challenge of adapting CLIP for zero-shot video action recognition with a lightweight and efficient approach. Through temporal visual prompting and a specialized learning objective, it captures motion dynamics effectively while retaining CLIP’s generalization capabilities. This makes EZ-CLIP both practical and powerful for video understanding tasks.

The work has been significantly advanced in our TMLR 2025 publication, T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning. Explore the T2L Repository for the latest developments.

Prerequisites

Set up the environment using the provided requirements.txt:

pip install -r requirements.txt

Model Zoo

Note: All models are based on the publicly available ViT/B-16 CLIP model.

Zero-Shot Results

Trained on Kinetics-400 and evaluated on downstream datasets.

Model	Input	HMDB-51	UCF-101	Kinetics-600	Model Link
EZ-CLIP (ViT-16)	8x224	52.9	79.1	70.1	Link

Base-to-Novel Generalization Results

Datasets are split into base and novel classes, with models trained on base classes and evaluated on both.

Dataset	Input	Base Acc.	Novel Acc.	HM	Model Link
K-400	8x224	73.1	60.6	66.3	Link
HMDB-51	8x224	77.0	58.2	66.3	Link
UCF-101	8x224	94.4	77.9	85.4	Link
SSV2	8x224	16.6	13.3	14.8	Link

Data Preparation

Extract videos into frames for efficient processing. See the Dataset_creation_scripts directory for instructions.
Supported datasets:

Training

Train EZ-CLIP with:

python train.py --config configs/K-400/k400_train.yaml

Testing

Evaluate a trained model with:

python test.py --config configs/ucf101/UCF_zero_shot_testing.yaml

Citation

If you find this code or models useful, please cite our work:

TMLR 2025 Publication:

@article{
ahmad2025tl,
title={T2L: Efficient Zero-Shot Action Recognition with Temporal Token Learning},
author={Shahzad Ahmad and Sukalpa Chanda and Yogesh S Rawat},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=WvgoxpGpuU},
note={}
}

arXiv Preprint:

@article{ahmad2023ezclip,
  title={EZ-CLIP: Efficient Zero-Shot Video Action Recognition},
  author={Ahmad, Shahzad and Chanda, Sukalpa and Rawat, Yogesh S},
  journal={arXiv preprint arXiv:2312.08010},
  year={2023}
}

Acknowledgments

This codebase builds upon ActionCLIP. We express our gratitude to the authors for their foundational contributions.
For the latest updates, visit the T2L Repository.

Contact: For questions or issues, please open an issue on this repository or the T2L Repository.

Explore the Future of Zero-Shot Action Recognition with T2L!
👉 T2L Repository 👈

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
Dataset_creation_scripts		Dataset_creation_scripts
GPT_discription		GPT_discription
UCF_101_txt		UCF_101_txt
__pycache__		__pycache__
clip		clip
configs		configs
dataset_splits		dataset_splits
datasets		datasets
lists		lists
logs		logs
modules		modules
utils		utils
EZ-CLIP.png		EZ-CLIP.png
LICENSE		LICENSE
README.md		README.md
T2L.jpg		T2L.jpg
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py
train.py		train.py
train_base_to_novel.py		train_base_to_novel.py
train_fullysupervised.py		train_fullysupervised.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

EZ-CLIP: Efficient Zero-Shot Video Action Recognition

🚀 Major Announcement: Published in TMLR 2025!

Updates

Overview

Contents

Introduction

Prerequisites

Model Zoo

Zero-Shot Results

Base-to-Novel Generalization Results

Data Preparation

Training

Testing

Citation

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Shahzadnit/EZ-CLIP

Folders and files

Latest commit

History

Repository files navigation

EZ-CLIP: Efficient Zero-Shot Video Action Recognition

🚀 Major Announcement: Published in TMLR 2025!

Updates

Overview

Contents

Introduction

Prerequisites

Model Zoo

Zero-Shot Results

Base-to-Novel Generalization Results

Data Preparation

Training

Testing

Citation

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages