OSWorld

OSWorld: OSWorld is an agent evaluation benchmark for desktop and browser computer-use tasks. It measures whether an agent can navigate real operating-system interfaces, click correct UI elements, and complete tasks that span multiple applications.

How it works

1
Tasks run inside real operating-system environments (Linux desktop, browsers) rather than in simulated text-only sandboxes.
2
The agent observes the screen — typically via screenshots — and emits actions like mouse clicks, keyboard input, and application launches.
3
Success is judged by whether the final state of the OS matches the task specification, not by intermediate reasoning quality.
4
Many tasks span multiple applications, which exposes weaknesses in cross-app context tracking that single-app benchmarks miss.

Example

An agent that scores well on OSWorld can actually drive a real desktop — open a file manager, edit a spreadsheet, copy data into an email — rather than just generating plausible-looking action sequences in text.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts

How it works

Example

Related Terms

Put this into practice