β¨This is the official implementation of paper Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks
1. π« Embodied Task 2. π« Deep Reasoning Model 3. π« Multimodal Scene 4. π« Long-horizon Decision 5. π« Multi-turn Interaction |
π€ Hugging Face
Β Β | Β Β Arxiv
Β Β | Β Β π WebPage
Β Β | Β Β πΊ Bilibili
video.mp4
ποΈ Paper Sharing: https://www.bilibili.com/video/BV1Cs7Hz4ETk?t=1623.2
- 2025.03: We release our paper and dataset.
In this paper, we present Embodied-Reasoner, a multimodal embodied model that extends o1-style deep-reasoning capabilities to embodied interactive tasks. It can perform complex tasks in AI2THOR such as searching for hidden objects, manipulating and transporting items with several impressive features:
- π Deep Reasoning abilities, e.g., analysis, spatial reasoning, reflection, planning.
- π Interleaved Multimodal Processing capabilities, especially handling long sequences of interleaved image-text context
- π Environmental Interaction abilities, enabling it to autonomously observe the environment, explore rooms, and find hidden objects
- π Open-source Models released in 7B/2B sizes
- π Open-source Dataset π€ Hugging Face: 9.3k interleaved observation-reasoning-action trajectories, including 64K images and 8M thought tokens.
Our contributions can be summarized as follows:
Task and Trajectory Engine: Automatically synthesizes coherent Observation-Thought-Action trajectories, spaning 107 diverse indoor scenes, e.g., kitchens and living rooms, and covers 2,100 interactive objects (e.g., eggs, laptops) and 2,600 containers (e.g., refrigerators, drawers), 64K a first-person perspective image from interaction and 8M thought tokens.
Long CoT with Diverse Thinking Pattern: analysis, spatial reasoning, reflection, planning, and verification. These coherent, image-text interleaved trajectories boost its spatial, temporal reasoning capabilities.
Iterative Training Pipeline: A three-stage iterative training pipeline that combines imitation learning, self-exploration tunning, and self-correction tunning.
Interactive Evaluation Framework: 809 test cases across 12 novel scenarios:
<Instruction, Key Action, Final state >
We compare the performance of Embodied-Reasoner against advanced VLMs and visual reasoning models.
- Success Rate (%) measures whether a task is successfully completed.
- Search Efficiency (%) evaluates task efficiencyβmore steps indicate lower efficiency.
- Task Completeness (%) computes the proportion of predicted actions that belong to the set of key actions.
Embodied-Reasoner exhibits spontaneous thinking behaviors, e.g., analyzing environmental states (#1,3), reflecting on missed details (#4), reasoning based on the latest observations (#5), and recalling cues for efficient planning (#9). These thoughts remain coherent and logically consistent despite spanning multiple rounds. In contrast, general VLMs lacking thinking abilities struggle with long-horizon interactive tasks and produce unreasonable actions, e.g., forget tasks or repetitive searching.
To evaluate the generalization of our reasoning model, we design a real-world experiment. Our model rules out the countertop and dining table after two explorations (steps 1,2), ultimately locating the coffee (#7) in the cabinet and placing it in the microwave for heating (#11). However, we observe that OpenAI o3-mini fails to formulate a reasonable plan, heading to the microwave first instead of searching for the coffee.
conda create -n llama-factory python=3.11
conda activate llama-factory
git clone -b embodied-reasoner https://github.com/iGangao/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
pip install wandb accelerate deepspeed importlib-metadata
Please refer to data/README.md
for checking the details about the format of dataset files.
Run the training scripts:
bash scripts/train.sh
conda create -n embodied-reasoner python=3.9
conda activate embodied-reasoner
pip install -r requirements.txt
Run the evaluation scripts:
bash scripts/eval.sh
You can navigate to the data_engine folder to synthesize tasks and trajectories. Below are the key files within the data_engine:
data_engine/
βββ taskgenerate/ # Item information and room metadata for task generation
β βββ bathrooms/
β βββ bedrooms/
β βββ kitchens/
β βββ living_rooms/
β βββ pick_up_and_put.json
βββ TaskGenerate.py # Task synthesis script
βββ o1StyleGenerate.py # Trajectory synthesis script
βββ o1StyleGenerate_ordered.py # Complex task trajectory synthesis script
βββ vlmCall.py # Script to call the VLM
βββ vlmCallapi_keys.py # Please Set your API keys here
TaskGenerate.py
can synthesize task templates and corresponding key actions. The generated task-related data will be stored in the <tasktype>_metadata
folder under data_engine.
You can run the following Python script to perform the task generation, and can modify parameters like task types within this Python file.
python TaskGenerate.py
For example, one generated task data entry is shown below, where actions contains a list of key actions for the task.
{
"taskname": "Locate the Apple in the room.",
"tasktype": "single_search",
"metadatapath": "taskgenerate/kitchens/FloorPlan1/metadata.json",
"actions": [
{
"action": "navigate to",
"objectId": "CounterTop|-00.08|+01.15|00.00",
"objectType": "CounterTop",
"baseaction": "",
"reward": 1,
"relatedObject": [
"CounterTop|-00.08|+01.15|00.00",
"Apple|-00.47|+01.15|+00.48"
]
},
{
"action": "end",
"objectId": "",
"objectType": "",
"baseaction": "",
"reward": 1,
"relatedObject": [
"CounterTop|-00.08|+01.15|00.00",
"Apple|-00.47|+01.15|+00.48"
]
}
],
"totalreward": 2
}
o1StyleGenerate.py
and o1StyleGenerate_ordered.py
can synthesize trajectories for 10 different sub-task types. Specifically, o1StyleGenerate_ordered.py is designed to synthesize more complex sequential object transfer tasks.
You can run the following Python script to perform the trajectory generation. Additionally, you can set the task type and trajectory type within the script (typically, 'b' is shortest, 'a' is longer, and 'c' is the longest).
python o1StyleGenerate.py
python o1StyleGenerate_ordered.py
Below is an example folder of a generated trajectory, including the JSON file and associated images for the trajectory.
Below is an example of the JSON file contents:
{
"scene": "FloorPlan1",
"tasktype": "...",
"taskname": "Locate the Apple in the room.",
"trajectory": [
"<...>...</...>",
"<...>...</...>",
"..."
],
"images": [
".../init_observe.png",
"..."
],
"flag": "",
"time": "...",
"task_metadata": {
"..."
}
}
- scene: the scene where the task is performed.
- tasktype: the type of the task.
- taskname: the name of the task.
- trajectory: reasoning and decision-making content of the trajectory
- images: paths to corresponding images (the first image represents the initial state; each subsequent image corresponds to the state after performing each action listed in trajectory).
- time and flag: records the generation timestamp and exceptions encountered during trajectory generation.
- task_metadata: task information generated during Step 1.
To view our complete trajectory dataset, please visit our Hugging Face Page.
Please refer to data_endine/README.md
for checking the details about the data engine.
If you find our work helpful, feel free to give us a cite.
@article{embodied-reasoner,
title = {Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks},
author = {Wenqi Zhang and Mengna Wang and Gangao Liu and Huixin Xu and Yiwei Jiang and Yongliang Shen and Guiyang Hou and Zhe Zheng and Hang Zhang and Xin Li and Weiming Lu and Peng Li and Yueting Zhuang},
journal = {arXiv preprint arXiv:2503.21696},
year = {2025}
}
The codebase is licensed under ζ¨ε °.
If you have any questions, please contact us by email: [email protected], [email protected]
Our training code uses LLaMA-Factory and uses the Simulator with Ai2-THOR. Thanks for their wonderful works.