OSWorld
OSWorld is an agent evaluation benchmark for desktop and browser computer-use tasks. It measures whether an agent can navigate real operating-system interfaces, click correct UI elements, and complete tasks that span multiple applications.
How it works
- 1
Tasks run inside real operating-system environments (Linux desktop, browsers) rather than in simulated text-only sandboxes.
- 2
The agent observes the screen — typically via screenshots — and emits actions like mouse clicks, keyboard input, and application launches.
- 3
Success is judged by whether the final state of the OS matches the task specification, not by intermediate reasoning quality.
- 4
Many tasks span multiple applications, which exposes weaknesses in cross-app context tracking that single-app benchmarks miss.
Example
An agent that scores well on OSWorld can actually drive a real desktop — open a file manager, edit a spreadsheet, copy data into an email — rather than just generating plausible-looking action sequences in text.
Related Terms
Put this into practice
Build polished, copy-ready prompts in under 60 seconds with SurePrompts.
Try SurePrompts