Open3DVQA: A Benchmark for Embodied Spatial Concept Reasoning with Multimodal Large Language Model in Open Space
We present Open3DVQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective.The QAs are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes.
- Aug-01-2025- Our paper is accepted by ACM MM 2025! π₯
- Jun-03-2025- Open3DVQA v2 is released at Open3DVQA-v2! π₯
- Mar-15-2025- Open3DVQA preprint released at Arxiv! π₯
- Feb-27-2025- Open3DVQA code/dataset released! π₯
Open3DVQA is a novel benchmark evaluating MLLMs' ability to reason about complex spatial relationships from an aerial view. It contains 89k QA pairs across 7 spatial reasoning tasksβincluding multiple-choice, true/false, and short-answer formatsβand supports both visual and point cloud data. Questions are automatically generated from spatial relations in real-world and simulated aerial scenes.
π‘ Key highlights:
- Covers four spatial perspectives and 7 task types for comprehensive open 3D spatial reasoning evaluation.
- Introduces a scalable QA generation pipeline that extracts 3D spatial relationships and creates diverse QA formats from a single RGB image, with a multi-modal correction flow to ensure quality.
- Benchmarks mainstream MLLMs, revealing their current spatial reasoning limitations and sim-to-real generalization capabilities.
QA Tasks | Questions | Responses | ||
---|---|---|---|---|
Allocentric Size Reasoning |
|
Gemini-2.5-Flash: No.βοΈ Qwen2-VL-Finetuned: Incorrect, the modern building with vertical glass panels's width is not less than the curved white railing structure's.βοΈ 3D-LLM: The modern building with vertical glass panels is thinner than the curved white railing structure.β |
||
Allocentric Distance Reasoning |
|
Gemini-2.5-Flash: Based only on the provided image, it is not possible to determine an exact numerical distance between the wooden pavilion with the blue metal pyramid roof and the gray stucco wall with the minimalist design and structure. The image does not contain the necessary depth or scale information to calculate this measurement.β Qwen2-VL-Finetuned: 11.22 meters.β 3D-LLM: The wooden pavilion with blue metal pyramid roof is a few meters away from the gray stucco wall with minimalist design and structure.β |
||
Egocentric Direction Reasoning |
|
Gemini-2.5-Flash: Yes, the white modular buildings with windows are to the left of you from the viewer's perspective (around 9 o'clock to 11 o'clock).βοΈ Qwen2-VL-Finetuned: Correct, you'll find the white modular buildings with windows to the left.βοΈ 3D-LLM: The white modular buildings with windows are to the left of the viewer's perspective.βοΈ |
||
Egocentric Distance Reasoning |
|
Gemini-2.5-Flash: Based on the image, the distance between the wooden pavilion with blue metal pyramid roof and you is 10 meters.βοΈ Qwen2-VL-Finetuned: The wooden pavilion with blue metal pyramid roof is 11.51 meters away.βοΈ 3D-LLM: The wooden pavilion with blue metal pyramid roof is approximately 3.05 meters tall.β |
||
Allocentric-Egocentric Transformation Direction Reasoning |
|
Gemini-2.5-Flash: If you are at the white building with irregular windows, you will find the area with dense green foliage at 12 o'clock.β Qwen2-VL-Finetuned: White building with irregular windows will find area with dense green foliage around the 7 o'clock direction.βοΈ 3D-LLM: The area with dense green foliage is near the white building with irregular windows.β |
||
Allocentric-Egocentric Transformation Distance Reasoning |
|
Gemini-2.5-Flash: -1.β Qwen2-VL-Finetuned: The vertical distance of the small wooden structure on ground is 21.52 meters.β 3D-LLM: 0.β |
||
Object-Centric Size Reasoning |
|
Gemini-2.5-Flash: I cannot determine the exact horizontal dimensions from the image alone.β Qwen2-VL-Finetuned: The dark stone lion sculpture with textured surface is 2.42 meters in width.βοΈ 3D-LLM: The horizontal dimensions of the dark stone lion sculpture with textured surface are.β |
Sample | RGB | Depth | Caption & Bounding Box | Mask | PointCloud |
---|---|---|---|---|---|
1 | ![]() |
![]() |
![]() |
![]() |
![]() |
2 | ![]() |
![]() |
![]() |
![]() |
![]() |
3 | ![]() |
![]() |
![]() |
![]() |
![]() |
4 | ![]() |
![]() |
![]() |
![]() |
![]() |
5 | ![]() |
![]() |
![]() |
![]() |
![]() |
6 | ![]() |
![]() |
![]() |
![]() |
![]() |
We've also made the QA generation pipeline available. Before running the code, make sure you complete the following three steps:
1. Set up the environment
Install all required Python packages and dependencies. You can use the provided requirements.txt
:
git clone https://github.com/WeichenZh/Open3DVQA.git
cd Open3DVQA
conda create -n o3dvqa python=3.10 -y
conda activate o3dvqa
pip install -r requirements.txt
2. Prepare the GPT-4o API access
You need access to the GPT-4o model via OpenAIβs API. Make sure your API key is correctly set as an environment variable:
export OPENAI_API_KEY=your_api_key_here
3. Download dataset and models
Please download the Open3DVQA dataset, ClipSeg and SAM:
Organize all codes and resources according to the following directory structure:
Open3DVQA/
βββ dataset/
β βββ EmbodiedCity/
β β βββ Wuhan/
β β β βββ depth/
β β β βββ pose
β β β βββ rgb/
β β β βββ visible_objs/
β β β βββ pointclouds/
β β β βββ chunk_0.pkl
β β β βββ ...
β β β βββ merged_qa.json
β βββ RealworldUAV/
β β βββ Lab/
β β βββ ...
β βββ UrbanScene/
β β βββ Campus
β β βββ ...
β βββ WildUAV/
β β βββ Wuhan/
βββ vqasynth/
β βββ models/
β β βββ clipseg/
β β βββ sam/
β βββ ...
βββ qa_pipeline.py
βββ inference.py
βββ evaluation.py
βββ processor/
β βββ process_caption.py
β βββ process_depth.py
β βββ process_segment.py
β βββ ...
βββ requirements.txt
Open qa_pipeline.py
and set the data_dir
variable to the scene you want to process. For example: data_dir = dataset/RealworldUAV
After saving your changes, run the script to start the QA generation process:
python qa_pipeline.py
The script will process the specified scene and generate QA pairs automatically. Input files are rgb/
, depth/
and pose/
. Output files contain pointclouds/
, chunk_*.pkl
and merged_qa.json
.
We also provide scripts for model inference and evaluation:
inference.py
This script allows you to perform QA using large language models (e.g., GPT-4o) via API. It takes prepared multimodal inputs and sends prompts to the model for response answer.
python inference.py
evaluation.py
This script is used to evaluate the accuracy of the model-responsed answers. It compares the predicted answers against ground truth answers to compute evaluation metrics such as accuracy.
python evaluation.py
We have used code snippets from different repositories, especially from: LLaVA, Qwen2-VL and VQASynth. We would like to acknowledge and thank the authors of these repositories for their excellent work.