Skip to content

Commit 8d455d9

Browse files
authored
Merge pull request #125 from simular-ai/s2_5
πŸ‘·πŸ§ πŸ€–πŸ˜΅πŸ•΄οΈ
2 parents eb6759d + 39d9635 commit 8d455d9

33 files changed

+3368
-593
lines changed

β€Ž.gitmodules

Lines changed: 0 additions & 3 deletions
This file was deleted.

β€ŽPerplexica

Lines changed: 0 additions & 1 deletion
This file was deleted.

β€ŽREADME.md

Lines changed: 81 additions & 116 deletions
Original file line numberDiff line numberDiff line change
@@ -100,116 +100,90 @@ Whether you're interested in AI, automation, or contributing to cutting-edge age
100100

101101

102102
## πŸ› οΈ Installation & Setup
103-
> **Note**: Our agent returns `pyautogui` code and is intended for a single monitor screen.
104103

105-
> ❗**Warning**❗: If you are on a Linux machine, creating a `conda` environment will interfere with `pyatspi`. As of now, there's no clean solution for this issue. Proceed through the installation without using `conda` or any virtual environment.
104+
### Prerequisites
105+
- **Single Monitor**: Our agent is designed for single monitor screens
106+
- **Linux Users**: Avoid `conda` environments as they interfere with `pyatspi`
107+
- **Security**: The agent runs Python code to control your computer - use with care
106108

107-
> ⚠️**Disclaimer**⚠️: To leverage the full potential of Agent S2, we utilize [UI-TARS](https://github.com/bytedance/UI-TARS) as a grounding model (7B-DPO or 72B-DPO for better performance). They can be hosted locally, or on Hugging Face Inference Endpoints. Our code supports Hugging Face Inference Endpoints. Check out [Hugging Face Inference Endpoints](https://huggingface.co/learn/cookbook/en/enterprise_dedicated_endpoints) for more information on how to set up and query this endpoint. However, running Agent S2 does not require this model, and you can use alternative API based models for visual grounding, such as Claude.
108-
109-
Install the package:
110-
```
109+
### Installation
110+
```bash
111111
pip install gui-agents
112112
```
113113

114-
Set your LLM API Keys and other environment variables. You can do this by adding the following line to your .bashrc (Linux), or .zshrc (MacOS) file.
114+
### API Configuration
115115

116-
```
116+
#### Option 1: Environment Variables
117+
Add to your `.bashrc` (Linux) or `.zshrc` (MacOS):
118+
```bash
117119
export OPENAI_API_KEY=<YOUR_API_KEY>
118120
export ANTHROPIC_API_KEY=<YOUR_ANTHROPIC_API_KEY>
119121
export HF_TOKEN=<YOUR_HF_TOKEN>
120122
```
121123

122-
Alternatively, you can set the environment variable in your Python script:
123-
124-
```
124+
#### Option 2: Python Script
125+
```python
125126
import os
126127
os.environ["OPENAI_API_KEY"] = "<YOUR_API_KEY>"
127128
```
128129

129-
We also support Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. For more information refer to [models.md](models.md).
130+
### Supported Models
131+
We support Azure OpenAI, Anthropic, Gemini, Open Router, and vLLM inference. See [models.md](models.md) for details.
130132

131-
> ❗**Warning**❗: The agent will directly run python code to control your computer. Please use with care.
133+
### Grounding Models (Required)
134+
For optimal performance, we recommend [UI-TARS-1.5-7B](https://huggingface.co/ByteDance-Seed/UI-TARS-1.5-7B) hosted on Hugging Face Inference Endpoints or another provider. See [Hugging Face Inference Endpoints](https://huggingface.co/learn/cookbook/en/enterprise_dedicated_endpoints) for setup instructions.
132135

133136
## πŸš€ Usage
134137

135138

136-
> **Note**: Our best configuration uses o3 and UI-TARS-1.5-7B.
137-
138-
### CLI
139+
> ⚑️ **Recommended Setup:**
140+
> For the best configuration, we recommend using **OpenAI o3-2025-04-16** as the main model, paired with **UI-TARS-1.5-7B** for grounding.
139141
140-
Run Agent S2 with a specific model (default is `gpt-4o`):
141142

142-
```sh
143-
agent_s2 \
144-
--provider "anthropic" \
145-
--model "claude-3-7-sonnet-20250219" \
146-
--grounding_model_provider "anthropic" \
147-
--grounding_model "claude-3-7-sonnet-20250219" \
148-
```
143+
### CLI
149144

150-
Or use a custom endpoint:
145+
Run Agent S2.5 with the required parameters:
151146

152147
```bash
153-
agent_s2 \
154-
--provider "anthropic" \
155-
--model "claude-3-7-sonnet-20250219" \
156-
--endpoint_provider "huggingface" \
157-
--endpoint_url "<endpoint_url>/v1/"
148+
agent_s \
149+
--provider openai \
150+
--model o3-2025-04-16 \
151+
--ground_provider huggingface \
152+
--ground_url http://localhost:8080 \
153+
--ground_model ui-tars-1.5-7b \
154+
--grounding_width 1920 \
155+
--grounding_height 1080
158156
```
159157

160-
#### Main Model Settings
161-
- **`--provider`**, **`--model`**
162-
- Purpose: Specifies the main generation model
163-
- Supports: all model providers in [models.md](models.md)
164-
- Default: `--provider "anthropic" --model "claude-3-7-sonnet-20250219"`
165-
- **`--model_url`**, **`--model_api_key`**
166-
- Purpose: Specifies the custom endpoint for the main generation model and your API key
167-
- Note: These are optional. If not specified, `gui-agents` will default to your environment variables for the URL and API key.
168-
- Supports: all model providers in [models.md](models.md)
169-
- Default: None
170-
171-
#### Grounding Configuration Options
172-
173-
You can use either Configuration 1 or Configuration 2:
174-
175-
##### **(Default) Configuration 1: API-Based Models**
176-
- **`--grounding_model_provider`**, **`--grounding_model`**
177-
- Purpose: Specifies the model for visual grounding (coordinate prediction)
178-
- Supports: all model providers in [models.md](models.md)
179-
- Default: `--grounding_model_provider "anthropic" --grounding_model "claude-3-7-sonnet-20250219"`
180-
- ❗**Important**❗ **`--grounding_model_resize_width`**
181-
- Purpose: Some API providers automatically rescale images. Therefore, the generated (x, y) will be relative to the rescaled image dimensions, instead of the original image dimensions.
182-
- Supports: [Anthropic rescaling](https://docs.anthropic.com/en/docs/build-with-claude/vision#)
183-
- Tips: If your grounding is inaccurate even for very simple queries, double check your rescaling width is correct for your machine's resolution.
184-
- Default: `--grounding_model_resize_width 1366` (Anthropic)
185-
186-
##### **Configuration 2: Custom Endpoint**
187-
- **`--endpoint_provider`**
188-
- Purpose: Specifies the endpoint provider
189-
- Supports: HuggingFace TGI, vLLM, Open Router
190-
- Default: None
191-
192-
- **`--endpoint_url`**
193-
- Purpose: The URL for your custom endpoint
194-
- Default: None
195-
196-
- **`--endpoint_api_key`**
197-
- Purpose: Your API key for your custom endpoint
198-
- Note: This is optional. If not specified, `gui-agents` will default to your environment variables for the API key.
199-
- Default: None
200-
201-
> **Note**: Configuration 2 takes precedence over Configuration 1.
202-
203-
This will show a user query prompt where you can enter your query and interact with Agent S2. You can use any model from the list of supported models in [models.md](models.md).
158+
#### Required Parameters
159+
- **`--provider`**: Main generation model provider (e.g., openai, anthropic, etc.) - Default: "openai"
160+
- **`--model`**: Main generation model name (e.g., o3-2025-04-16) - Default: "o3-2025-04-16"
161+
- **`--ground_provider`**: The provider for the grounding model - **Required**
162+
- **`--ground_url`**: The URL of the grounding model - **Required**
163+
- **`--ground_model`**: The model name for the grounding model - **Required**
164+
- **`--grounding_width`**: Width of the output coordinate resolution from the grounding model - **Required**
165+
- **`--grounding_height`**: Height of the output coordinate resolution from the grounding model - **Required**
166+
167+
#### Grounding Model Dimensions
168+
The grounding width and height should match the output coordinate resolution of your grounding model:
169+
- **UI-TARS-1.5-7B**: Use `--grounding_width 1920 --grounding_height 1080`
170+
- **UI-TARS-72B**: Use `--grounding_width 1000 --grounding_height 1000`
171+
172+
#### Optional Parameters
173+
- **`--model_url`**: Custom API URL for main generation model - Default: ""
174+
- **`--model_api_key`**: API key for main generation model - Default: ""
175+
- **`--ground_api_key`**: API key for grounding model endpoint - Default: ""
176+
- **`--max_trajectory_length`**: Maximum number of image turns to keep in trajectory - Default: 8
177+
- **`--enable_reflection`**: Enable reflection agent to assist the worker agent - Default: True
204178

205179
### `gui_agents` SDK
206180

207-
First, we import the necessary modules. `AgentS2` is the main agent class for Agent S2. `OSWorldACI` is our grounding agent that translates agent actions into executable python code.
181+
First, we import the necessary modules. `AgentS2_5` is the main agent class for Agent S2.5. `OSWorldACI` is our grounding agent that translates agent actions into executable python code.
208182
```python
209183
import pyautogui
210184
import io
211-
from gui_agents.s2.agents.agent_s import AgentS2
212-
from gui_agents.s2.agents.grounding import OSWorldACI
185+
from gui_agents.s2_5.agents.agent_s import AgentS2_5
186+
from gui_agents.s2_5.agents.grounding import OSWorldACI
213187

214188
# Load in your API keys.
215189
from dotenv import load_dotenv
@@ -218,7 +192,7 @@ load_dotenv()
218192
current_platform = "linux" # "darwin", "windows"
219193
```
220194

221-
Next, we define our engine parameters. `engine_params` is used for the main agent, and `engine_params_for_grounding` is for grounding. For `engine_params_for_grounding`, we support the Claude, GPT series, and Hugging Face Inference Endpoints.
195+
Next, we define our engine parameters. `engine_params` is used for the main agent, and `engine_params_for_grounding` is for grounding. For `engine_params_for_grounding`, we support custom endpoints like HuggingFace TGI, vLLM, and Open Router.
222196

223197
```python
224198
engine_params = {
@@ -228,50 +202,45 @@ engine_params = {
228202
"api_key": model_api_key, # Optional
229203
}
230204

231-
# Grounding Configuration 1: Load the grounding engine from an API based model
232-
grounding_model_provider = "<your_grounding_model_provider>"
233-
grounding_model = "<your_grounding_model>"
234-
grounding_model_resize_width = 1366
235-
screen_width, screen_height = pyautogui.size()
205+
# Load the grounding engine from a custom endpoint
206+
ground_provider = "<your_ground_provider>"
207+
ground_url = "<your_ground_url>"
208+
ground_model = "<your_ground_model>"
209+
ground_api_key = "<your_ground_api_key>"
236210

237-
engine_params_for_grounding = {
238-
"engine_type": grounding_model_provider,
239-
"model": grounding_model,
240-
"grounding_width": grounding_model_resize_width,
241-
"grounding_height": screen_height
242-
* grounding_model_resize_width
243-
/ screen_width,
244-
}
245-
246-
# Grounding Configuration 2: Load the grounding engine from a HuggingFace TGI endpoint
247-
endpoint_provider = "<your_endpoint_provider>"
248-
endpoint_url = "<your_endpoint_url>"
249-
endpoint_api_key = "<your_api_key>"
211+
# Set grounding dimensions based on your model's output coordinate resolution
212+
# UI-TARS-1.5-7B: grounding_width=1920, grounding_height=1080
213+
# UI-TARS-72B: grounding_width=1000, grounding_height=1000
214+
grounding_width = 1920 # Width of output coordinate resolution
215+
grounding_height = 1080 # Height of output coordinate resolution
250216

251217
engine_params_for_grounding = {
252-
"engine_type": endpoint_provider,
253-
"base_url": endpoint_url,
254-
"api_key": endpoint_api_key, # Optional
218+
"engine_type": ground_provider,
219+
"model": ground_model,
220+
"base_url": ground_url,
221+
"api_key": ground_api_key, # Optional
222+
"grounding_width": grounding_width,
223+
"grounding_height": grounding_height,
255224
}
256225
```
257226

258-
Then, we define our grounding agent and Agent S2.
227+
Then, we define our grounding agent and Agent S2.5.
259228

260229
```python
261230
grounding_agent = OSWorldACI(
262231
platform=current_platform,
263232
engine_params_for_generation=engine_params,
264-
engine_params_for_grounding=engine_params_for_grounding
233+
engine_params_for_grounding=engine_params_for_grounding,
234+
width=1920, # Optional: screen width
235+
height=1080 # Optional: screen height
265236
)
266237

267-
agent = AgentS2(
268-
engine_params,
269-
grounding_agent,
270-
platform=current_platform,
271-
action_space="pyautogui",
272-
observation_type="screenshot",
273-
search_engine="Perplexica", # Assuming you have set up Perplexica.
274-
embedding_engine_type="openai" # Supports "gemini", "openai"
238+
agent = AgentS2_5(
239+
engine_params,
240+
grounding_agent,
241+
platform=current_platform,
242+
max_trajectory_length=8, # Optional: maximum image turns to keep
243+
enable_reflection=True # Optional: enable reflection agent
275244
)
276245
```
277246

@@ -294,19 +263,15 @@ info, action = agent.predict(instruction=instruction, observation=obs)
294263
exec(action[0])
295264
```
296265

297-
Refer to `gui_agents/s2/cli_app.py` for more details on how the inference loop works.
266+
Refer to `gui_agents/s2_5/cli_app.py` for more details on how the inference loop works.
298267

299268
### OSWorld
300269

301-
To deploy Agent S2 in OSWorld, follow the [OSWorld Deployment instructions](osworld_setup/s2/OSWorld.md).
302-
303-
### WindowsAgentArena
304-
305-
To deploy Agent S2 in WindowsAgentArena, follow the [WindowsAgentArena Deployment Instructions](WAA_setup.md).
270+
To deploy Agent S2.5 in OSWorld, follow the [OSWorld Deployment instructions](osworld_setup/s2_5/OSWorld.md).
306271

307272
## πŸ’¬ Citations
308273

309-
If you find this codebase useful, please cite
274+
If you find this codebase useful, please cite:
310275

311276
```
312277
@misc{Agent-S2,
File renamed without changes.

β€Žgui_agents/s2/core/engine.py

Lines changed: 6 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -469,15 +469,17 @@ def generate(self, messages, temperature=0.0, max_new_tokens=None, **kwargs):
469469
"A Parasail API key needs to be provided in either the api_key parameter or as an environment variable named PARASAIL_API_KEY"
470470
)
471471
if not self.llm_client:
472-
self.llm_client = OpenAI(base_url="https://api.parasail.io/v1", api_key=api_key)
472+
self.llm_client = OpenAI(
473+
base_url="https://api.parasail.io/v1", api_key=api_key
474+
)
473475
return (
474476
self.llm_client.chat.completions.create(
475477
model=self.model,
476478
messages=messages,
477479
max_tokens=max_new_tokens if max_new_tokens else 4096,
478480
temperature=temperature,
479-
**kwargs
481+
**kwargs,
480482
)
481-
.choices[0].
482-
message.content
483+
.choices[0]
484+
.message.content
483485
)

β€Žgui_agents/s2/core/mllm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ def add_message(
128128
LMMEngineHuggingFace,
129129
LMMEngineGemini,
130130
LMMEngineOpenRouter,
131-
LMMEngineParasail
131+
LMMEngineParasail,
132132
),
133133
):
134134
# infer role from previous message

β€Žgui_agents/s2_5/agents/__init__.py

Whitespace-only changes.

0 commit comments

Comments
Β (0)