Skip to content

zwq2018/embodied_reasoner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

36 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Embodied-Reasoner

✨This is the official implementation of paper Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

1. πŸ’« Embodied Task

2. πŸ’« Deep Reasoning Model

3. πŸ’« Multimodal Scene

4. πŸ’« Long-horizon Decision

5. πŸ’« Multi-turn Interaction

πŸ€— Hugging Face Β Β  | Β Β  arXiv Arxiv Β Β  | Β Β  πŸ“‘ WebPage Β Β  | Β Β  πŸ“Ί Bilibili

Video πŸ“· πŸ“·

video.mp4

πŸŽ™οΈ Paper Sharing: https://www.bilibili.com/video/BV1Cs7Hz4ETk?t=1623.2

News πŸ”₯πŸ”₯

  • 2025.03: We release our paper and dataset.

Contents 🌳🌳

Overview 🦾🦾

In this paper, we present Embodied-Reasoner, a multimodal embodied model that extends o1-style deep-reasoning capabilities to embodied interactive tasks. It can perform complex tasks in AI2THOR such as searching for hidden objects, manipulating and transporting items with several impressive features:

  • πŸ‘‰ Deep Reasoning abilities, e.g., analysis, spatial reasoning, reflection, planning.
  • πŸ‘‰ Interleaved Multimodal Processing capabilities, especially handling long sequences of interleaved image-text context
  • πŸ‘‰ Environmental Interaction abilities, enabling it to autonomously observe the environment, explore rooms, and find hidden objects
  • πŸ‘‰ Open-source Models released in 7B/2B sizes
  • πŸ‘‰ Open-source Dataset πŸ€— Hugging Face: 9.3k interleaved observation-reasoning-action trajectories, including 64K images and 8M thought tokens.

Our contributions can be summarized as follows:

Task and Trajectory Engine: Automatically synthesizes coherent Observation-Thought-Action trajectories, spaning 107 diverse indoor scenes, e.g., kitchens and living rooms, and covers 2,100 interactive objects (e.g., eggs, laptops) and 2,600 containers (e.g., refrigerators, drawers), 64K a first-person perspective image from interaction and 8M thought tokens.

Long CoT with Diverse Thinking Pattern: analysis, spatial reasoning, reflection, planning, and verification. These coherent, image-text interleaved trajectories boost its spatial, temporal reasoning capabilities.

Iterative Training Pipeline: A three-stage iterative training pipeline that combines imitation learning, self-exploration tunning, and self-correction tunning.

Interactive Evaluation Framework: 809 test cases across 12 novel scenarios: <Instruction, Key Action, Final state >

Performance 🌿🌿

We compare the performance of Embodied-Reasoner against advanced VLMs and visual reasoning models.

  • Success Rate (%) measures whether a task is successfully completed.
  • Search Efficiency (%) evaluates task efficiencyβ€”more steps indicate lower efficiency.
  • Task Completeness (%) computes the proportion of predicted actions that belong to the set of key actions.

Examples πŸ‘€ πŸ‘€

Simulator Experiments

Embodied-Reasoner exhibits spontaneous thinking behaviors, e.g., analyzing environmental states (#1,3), reflecting on missed details (#4), reasoning based on the latest observations (#5), and recalling cues for efficient planning (#9). These thoughts remain coherent and logically consistent despite spanning multiple rounds. In contrast, general VLMs lacking thinking abilities struggle with long-horizon interactive tasks and produce unreasonable actions, e.g., forget tasks or repetitive searching.

Real-World Experiments

To evaluate the generalization of our reasoning model, we design a real-world experiment. Our model rules out the countertop and dining table after two explorations (steps 1,2), ultimately locating the coffee (#7) in the cabinet and placing it in the microwave for heating (#11). However, we observe that OpenAI o3-mini fails to formulate a reasonable plan, heading to the microwave first instead of searching for the coffee.

QuickStart 🎯🎯

Training

Step 1. Install Requirements

conda create -n llama-factory python=3.11
conda activate llama-factory
git clone -b embodied-reasoner https://github.com/iGangao/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
pip install wandb accelerate deepspeed importlib-metadata

Step 2. Data prepare

Please refer to data/README.md for checking the details about the format of dataset files.

Step 3. Run training scripts

Run the training scripts:

bash scripts/train.sh

Evaluation

Step 1. Install Requirements

conda create -n embodied-reasoner python=3.9
conda activate embodied-reasoner
pip install -r requirements.txt

Step 2. Run evaluation scripts

Run the evaluation scripts:

bash scripts/eval.sh

Task and Trajectory Engine β›²β›²

You can navigate to the data_engine folder to synthesize tasks and trajectories. Below are the key files within the data_engine:

data_engine/
β”œβ”€β”€ taskgenerate/               # Item information and room metadata for task generation
β”‚   β”œβ”€β”€ bathrooms/
β”‚   β”œβ”€β”€ bedrooms/
β”‚   β”œβ”€β”€ kitchens/
β”‚   β”œβ”€β”€ living_rooms/
β”‚   └── pick_up_and_put.json
β”œβ”€β”€ TaskGenerate.py             # Task synthesis script
β”œβ”€β”€ o1StyleGenerate.py          # Trajectory synthesis script
β”œβ”€β”€ o1StyleGenerate_ordered.py  # Complex task trajectory synthesis script
β”œβ”€β”€ vlmCall.py                  # Script to call the VLM
└── vlmCallapi_keys.py          # Please Set your API keys here

Step 1. Generate Task

TaskGenerate.py can synthesize task templates and corresponding key actions. The generated task-related data will be stored in the <tasktype>_metadata folder under data_engine.

You can run the following Python script to perform the task generation, and can modify parameters like task types within this Python file.

python TaskGenerate.py

For example, one generated task data entry is shown below, where actions contains a list of key actions for the task.

{
    "taskname": "Locate the Apple in the room.",
    "tasktype": "single_search",
    "metadatapath": "taskgenerate/kitchens/FloorPlan1/metadata.json",
    "actions": [
        {
            "action": "navigate to",
            "objectId": "CounterTop|-00.08|+01.15|00.00",
            "objectType": "CounterTop",
            "baseaction": "",
            "reward": 1,
            "relatedObject": [
                "CounterTop|-00.08|+01.15|00.00",
                "Apple|-00.47|+01.15|+00.48"
            ]
        },
        {
            "action": "end",
            "objectId": "",
            "objectType": "",
            "baseaction": "",
            "reward": 1,
            "relatedObject": [
                "CounterTop|-00.08|+01.15|00.00",
                "Apple|-00.47|+01.15|+00.48"
            ]
        }
    ],
    "totalreward": 2
}

Step 2. Generate O1-style Trajectory

o1StyleGenerate.py and o1StyleGenerate_ordered.py can synthesize trajectories for 10 different sub-task types. Specifically, o1StyleGenerate_ordered.py is designed to synthesize more complex sequential object transfer tasks.

You can run the following Python script to perform the trajectory generation. Additionally, you can set the task type and trajectory type within the script (typically, 'b' is shortest, 'a' is longer, and 'c' is the longest).

python o1StyleGenerate.py
python o1StyleGenerate_ordered.py

Below is an example folder of a generated trajectory, including the JSON file and associated images for the trajectory.

Below is an example of the JSON file contents:

{
    "scene": "FloorPlan1",
    "tasktype": "...",
    "taskname": "Locate the Apple in the room.",
    "trajectory": [
        "<...>...</...>",
        "<...>...</...>",
        "..."
    ],
    "images": [
        ".../init_observe.png",
        "..."
    ],
    "flag": "",
    "time": "...",
    "task_metadata": {
        "..."
    }
}
  • scene: the scene where the task is performed.
  • tasktype: the type of the task.
  • taskname: the name of the task.
  • trajectory: reasoning and decision-making content of the trajectory
  • images: paths to corresponding images (the first image represents the initial state; each subsequent image corresponds to the state after performing each action listed in trajectory).
  • time and flag: records the generation timestamp and exceptions encountered during trajectory generation.
  • task_metadata: task information generated during Step 1.

To view our complete trajectory dataset, please visit our Hugging Face Page.

Please refer to data_endine/README.md for checking the details about the data engine.

Citation

If you find our work helpful, feel free to give us a cite.

@article{embodied-reasoner,
    title   = {Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks}, 
    author  = {Wenqi Zhang and Mengna Wang and Gangao Liu and Huixin Xu and Yiwei Jiang and Yongliang Shen and Guiyang Hou and Zhe Zheng and Hang Zhang and Xin Li and Weiming Lu and Peng Li and Yueting Zhuang},
    journal = {arXiv preprint arXiv:2503.21696},
    year    = {2025}
}

License

Code License

The codebase is licensed under ζœ¨ε…°.

Contact Us

If you have any questions, please contact us by email: [email protected], [email protected]

Acknowledgements

Our training code uses LLaMA-Factory and uses the Simulator with Ai2-THOR. Thanks for their wonderful works.

About

Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •