Skip to content

A comprehensive collection of Vision-Language-Action (VLA) models, benchmarks, and datasets for robotic manipulation and embodied AI research, featuring personally tested reproductions, evaluation environments, and large-scale datasets to serve as a practical guide

Notifications You must be signed in to change notification settings

LukeLIN-web/Awesome-VLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 

Repository files navigation

Awesome Vision-Language-Action Models

A curated list of Vision-Language-Action (VLA) models, benchmarks, and datasets for robotic manipulation and embodied AI.

Table of Contents

VLA Models

SpatialVLA

  • Paper: https://arxiv.org/abs/2501.15830
  • Status: ✅ Successfully reproduced the results in the paper
  • Notes: Code is very clean. using PaliGemma 3B LLM, tokenizer bin action head.

OpenVLA-OFT

  • Website: https://openvla-oft.github.io
  • Status: ✅ Successfully reproduced the results in the paper
  • Notes: using llama2-7B LLM, mlp or diffusion action head.

Pi0

  • Achieves ~94% average accuracy on the LIBERO benchmark.
    Reference
  • Model details: Uses PaliGemma-3B as the LLM and DiT for the action head.

Real-world observations

  1. Performs well with only 80 samples, fine-tuned on A100.
  2. Scales with 3–4k high-quality samples. Successful fine-tuning on
    Hugging Face model using:
    • bf16
    • batch size = 12
    • ~70GB VRAM , 8h100, 15 hours.
    • multi-machine setup
    • DeepSpeed ZeRO-2 (no offloading)
      Training from scratch fails when data is limited.
  3. pi0-fast variant works effectively in this paper.
    Project site: Physical Intelligence – pi0-fast

ACT


Diffusion Policy


SmolVLA

  • Paper: https://arxiv.org/abs/2506.01844
  • Successfully tested 450M checkpoint on Lerobot SO101 for real-world fork picking tasks. Training parameters: batch size 12, 4.1GB VRAM usage, converges between 3,000-27,000 steps.

GR00T N1.5


UniVLA

Note: There are hundreds of VLA models available. This list focuses on models that I have personally tested or for which reproduction results have been reported somewhere.

Benchmarks

✅ Tested Benchmarks

🔄 Benchmarks to Try

Datasets

✅ Tested Datasets

Open X-Embodiment

Dataset Breakdown:

1.2T    ./fmb_dataset
126G    ./taco_play
128G    ./bc_z
124G    ./bridge_orig                                            # 2.1M samples
140G    ./furniture_bench_dataset_converted_externally_to_rlds
98G     ./fractal20220817_data                               # 3.78M samples
70G     ./kuka
22G     ./dobbe
20G     ./berkeley_autolab_ur5
16G     ./stanford_hydra_dataset_converted_externally_to_rlds
16G     ./utaustin_mutex
14G     ./austin_sailor_dataset_converted_externally_to_rlds
13G     ./nyu_franka_play_dataset_converted_externally_to_rlds
11G     ./toto
8.0G    ./austin_sirius_dataset_converted_externally_to_rlds
5.9G    ./iamlab_cmu_pickup_insert_converted_externally_to_rlds
4.5G    ./roboturk
3.3G    ./berkeley_cable_routing
3.2G    ./viola
3.0G    ./jaco_play
2.5G    ./berkeley_fanuc_manipulation
1.2G    ./austin_buds_dataset_converted_externally_to_rlds
510M    ./cmu_stretch
263M    ./dlr_edan_shared_control_converted_externally_to_rlds
110M    ./ucsd_kitchen_dataset_converted_externally_to_rlds

GR00T Teleop Simulation Dataset

Droid

CALVIN

RoboTwin

BEHAVIOR-1K

Contributing

We welcome contributions! Please feel free to:

  • Add new VLA models you've tested
  • Share benchmark that is easy to use
  • Report dataset experiences
  • Submit pull requests or issues

Contact

Feel free to send us pull requests, issues, or email to share your reproduction experience!

About

A comprehensive collection of Vision-Language-Action (VLA) models, benchmarks, and datasets for robotic manipulation and embodied AI research, featuring personally tested reproductions, evaluation environments, and large-scale datasets to serve as a practical guide

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published