Open3DVQA: A Benchmark for Embodied Spatial Concept Reasoning with Multimodal Large Language Model in Open Space

📄 Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space

We present Open3DVQA, a novel benchmark for evaluating MLLMs' ability to reason about complex spatial relationships from an aerial perspective.The QAs are automatically generated from spatial relations extracted from both real-world and simulated aerial scenes.

📢 News

Aug-01-2025- Our paper is accepted by ACM MM 2025! 🔥
Jun-03-2025- Open3DVQA v2 is released at Open3DVQA-v2! 🔥
Mar-15-2025- Open3DVQA preprint released at Arxiv! 🔥
Feb-27-2025- Open3DVQA code/dataset released! 🔥

✅ Open3DVQA Benchmark

Open3DVQA is a novel benchmark evaluating MLLMs' ability to reason about complex spatial relationships from an aerial view. It contains 89k QA pairs across 7 spatial reasoning tasks—including multiple-choice, true/false, and short-answer formats—and supports both visual and point cloud data. Questions are automatically generated from spatial relations in real-world and simulated aerial scenes.

💡 Key highlights:

Covers four spatial perspectives and 7 task types for comprehensive open 3D spatial reasoning evaluation.
Introduces a scalable QA generation pipeline that extracts 3D spatial relationships and creates diverse QA formats from a single RGB image, with a multi-modal correction flow to ensure quality.
Benchmarks mainstream MLLMs, revealing their current spatial reasoning limitations and sim-to-real generalization capabilities.

📋 QA Templates

QA Tasks

Intention

Examples

Allocentric Size Reasoning

To infer relative size relationships between two objects in space, such as longer/shorter, wider/narrower, taller/shorter, larger/smaller.

Question: Is the modern building with vertical glass panels thinner than the curved white railing structure? Answer: No, the modern building with vertical glass panels is not thinner than the curved white railing structure.
Question: Which of these two, the white modular buildings with windows or the tall beige residential apartments, appears wider? Answer: Appearing wider between the two is white modular buildings with windows.

Allocentric Distance Reasoning

To infer straight-line, vertical or horizontal distances between objects.

Question: How close is the white building with small square windows from the grey pathway leading to the building? Answer: A distance of 32.15 meters exists between the white building with small square windows and the grey pathway leading to the building.
Question: How far is the row of parked white vans from the white building with blue stripes horizontally? Answer: The row of parked white vans is 6.26 meters away from the white building with blue stripes horizontally.

Egocentric Direction Reasoning

To infer the direction of an object relative to the agent, such as left, right, up and down.

Question: Is the white modular buildings with windows to the left of you from the viewer's perspective? Answer: Yes, the white modular buildings with windows is to the left.
Question: Which is closer to viewer, the white suv parked in foreground or the blue building with tree beside? A:white suv parked in foreground B:blue building with tree beside C:Same D:Unknown Answer: A.white suv parked in foreground.

Egocentric Distance Reasoning

To infer the straight-line distance of an object from the agent.

Question: How far is the red storefront with chinese text from you? Answer: 29.0 meters
Question: How close is the white building with blue stripes from you? Answer: The distance of the white building with blue stripes is 41.73 meters.

Allocentric-Egocentric Transformation Direction Reasoning

The agent infers the direction of objects relative to itself based on its movement.

Question: If you are at white building with small square windows, where will you find grey pathway leading to the building? Answer: Grey pathway leading to the building is roughly at 9 o'clock from white building with small square windows.
Question: If you are at row of parked white vans, where will you find white building with blue stripes? Answer: Row of parked white vans will find white building with blue stripes around the 10 o'clock direction.

Allocentric-Egocentric Transformation Distance Reasoning

The agent infers object distance in the horizontal or vertical direction relative to itself.

Question: Could you provide the vertical distance between the white building with blue stripes and you? Answer: 4.19 meters
Question: How distant is the green foliage surrounding the structure from you horizontally? Answer: Horizontally, 96.91 meters apart.

Object-Centric Size Reasoning

To infer the absolute size of a single object, such as its length, width or height.

Question: Determine the horizontal dimensions of the dark stone lion sculpture with textured surface. Answer: The dark stone lion sculpture with textured surface is 2.49 meters wide.
Question: How tall is the curved black streetlamp with modern design? Answer: The height of the curved black streetlamp with modern design is 7.26 meters.

💬 Response Examples

QA Tasks

Questions

Responses

Allocentric Size
Reasoning

Is the modern building with vertical glass panels thinner than the curved white railing structure?

Gemini-2.5-Flash: No.✔️
Qwen2-VL-Finetuned: Incorrect, the modern building with vertical glass panels's width is not less than the curved white railing structure's.✔️
3D-LLM: The modern building with vertical glass panels is thinner than the curved white railing structure.❌

Allocentric Distance
Reasoning

How much distance is there between the wooden pavilion with blue metal pyramid roof and the gray stucco wall with minimalist design and structure?

Gemini-2.5-Flash: Based only on the provided image, it is not possible to determine an exact numerical distance between the wooden pavilion with the blue metal pyramid roof and the gray stucco wall with the minimalist design and structure. The image does not contain the necessary depth or scale information to calculate this measurement.❌
Qwen2-VL-Finetuned: 11.22 meters.❌
3D-LLM: The wooden pavilion with blue metal pyramid roof is a few meters away from the gray stucco wall with minimalist design and structure.❌

Egocentric Direction
Reasoning

Is the white modular buildings with windows to the left of you from the viewer's perspective?

Gemini-2.5-Flash: Yes, the white modular buildings with windows are to the left of you from the viewer's perspective (around 9 o'clock to 11 o'clock).✔️
Qwen2-VL-Finetuned: Correct, you'll find the white modular buildings with windows to the left.✔️
3D-LLM: The white modular buildings with windows are to the left of the viewer's perspective.✔️

Egocentric Distance
Reasoning

Can you give me an estimation of the distance between the wooden pavilion with blue metal pyramid roof and you?

Gemini-2.5-Flash: Based on the image, the distance between the wooden pavilion with blue metal pyramid roof and you is 10 meters.✔️
Qwen2-VL-Finetuned: The wooden pavilion with blue metal pyramid roof is 11.51 meters away.✔️
3D-LLM: The wooden pavilion with blue metal pyramid roof is approximately 3.05 meters tall.❌

Allocentric-Egocentric
Transformation Direction
Reasoning

If you are at white building with irregular windows, where will you find area with dense green foliage?

Gemini-2.5-Flash: If you are at the white building with irregular windows, you will find the area with dense green foliage at 12 o'clock.❌
Qwen2-VL-Finetuned: White building with irregular windows will find area with dense green foliage around the 7 o'clock direction.✔️
3D-LLM: The area with dense green foliage is near the white building with irregular windows.❌

Allocentric-Egocentric
Transformation Distance
Reasoning

Give me the vertical distance from the small wooden structure on ground to you.

Gemini-2.5-Flash: -1.❌
Qwen2-VL-Finetuned: The vertical distance of the small wooden structure on ground is 21.52 meters.❌
3D-LLM: 0.❌

Object-Centric
Size Reasoning

Determine the horizontal dimensions of the dark stone lion sculpture with textured surface.

Gemini-2.5-Flash: I cannot determine the exact horizontal dimensions from the image alone.❌
Qwen2-VL-Finetuned: The dark stone lion sculpture with textured surface is 2.42 meters in width.✔️
3D-LLM: The horizontal dimensions of the dark stone lion sculpture with textured surface are.❌

🖼️ Multiple Modalities

Sample	RGB	Depth	Caption & Bounding Box	Mask	PointCloud
1
2
3
4
5
6

🛠️ QA Generation Pipeline

We've also made the QA generation pipeline available. Before running the code, make sure you complete the following three steps:

1. Set up the environment

Install all required Python packages and dependencies. You can use the provided requirements.txt:

git clone https://github.com/WeichenZh/Open3DVQA.git
cd Open3DVQA
conda create -n o3dvqa python=3.10 -y
conda activate o3dvqa
pip install -r requirements.txt

2. Prepare the GPT-4o API access

You need access to the GPT-4o model via OpenAI’s API. Make sure your API key is correctly set as an environment variable:

export OPENAI_API_KEY=your_api_key_here

3. Download dataset and models

Please download the Open3DVQA dataset, ClipSeg and SAM:

Organize all codes and resources according to the following directory structure:

Open3DVQA/
├── dataset/
│   ├── EmbodiedCity/
│   │   ├── Wuhan/
│   │   │   ├── depth/
│   │   │   ├── pose
│   │   │   ├── rgb/
│   │   │   ├── visible_objs/
│   │   │   ├── pointclouds/
│   │   │   ├── chunk_0.pkl
│   │   │   ├── ...
│   │   │   ├── merged_qa.json
│   ├── RealworldUAV/
│   │   ├── Lab/
│   │   ├── ...
│   ├── UrbanScene/
│   │   ├── Campus
│   │   ├── ...
│   ├── WildUAV/
│   │   ├── Wuhan/
├── vqasynth/
│   ├── models/
│   │   ├── clipseg/
│   │   ├── sam/
│   ├── ...
├── qa_pipeline.py
├── inference.py
├── evaluation.py
├── processor/
│   ├── process_caption.py
│   ├── process_depth.py
│   ├── process_segment.py
│   ├── ...
├── requirements.txt

Open qa_pipeline.py and set the data_dir variable to the scene you want to process. For example: data_dir = dataset/RealworldUAV

After saving your changes, run the script to start the QA generation process:

python qa_pipeline.py

The script will process the specified scene and generate QA pairs automatically. Input files are rgb/, depth/ and pose/. Output files contain pointclouds/, chunk_*.pkl and merged_qa.json.

🚀 Inference & Evaluation

We also provide scripts for model inference and evaluation:

inference.py
This script allows you to perform QA using large language models (e.g., GPT-4o) via API. It takes prepared multimodal inputs and sends prompts to the model for response answer.

python inference.py

evaluation.py
This script is used to evaluate the accuracy of the model-responsed answers. It compares the predicted answers against ground truth answers to compute evaluation metrics such as accuracy.

python evaluation.py

🙏 Acknowledgement

We have used code snippets from different repositories, especially from: LLaVA, Qwen2-VL and VQASynth. We would like to acknowledge and thank the authors of these repositories for their excellent work.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.idea		.idea
figure		figure
processor		processor
vqasynth		vqasynth
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluation.py		evaluation.py
inference.py		inference.py
qa_pipeline.py		qa_pipeline.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Open3DVQA: A Benchmark for Embodied Spatial Concept Reasoning with Multimodal Large Language Model in Open Space

📢 News

✅ Open3DVQA Benchmark

📋 QA Templates

💬 Response Examples

🖼️ Multiple Modalities

🛠️ QA Generation Pipeline

🚀 Inference & Evaluation

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

EmbodiedCity/Open3D-VQA.code

Folders and files

Latest commit

History

Repository files navigation

Open3DVQA: A Benchmark for Embodied Spatial Concept Reasoning with Multimodal Large Language Model in Open Space

📢 News

✅ Open3DVQA Benchmark

📋 QA Templates

💬 Response Examples

🖼️ Multiple Modalities

🛠️ QA Generation Pipeline

🚀 Inference & Evaluation

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages