Skip to main content

OSWorld

OSWorld is an agent evaluation benchmark for desktop and browser computer-use tasks. It measures whether an agent can navigate real operating-system interfaces, click correct UI elements, and complete tasks that span multiple applications.

How it works

  1. 1

    Tasks run inside real operating-system environments (Linux desktop, browsers) rather than in simulated text-only sandboxes.

  2. 2

    The agent observes the screen — typically via screenshots — and emits actions like mouse clicks, keyboard input, and application launches.

  3. 3

    Success is judged by whether the final state of the OS matches the task specification, not by intermediate reasoning quality.

  4. 4

    Many tasks span multiple applications, which exposes weaknesses in cross-app context tracking that single-app benchmarks miss.

Example

An agent that scores well on OSWorld can actually drive a real desktop — open a file manager, edit a spreadsheet, copy data into an email — rather than just generating plausible-looking action sequences in text.

Frequently asked questions

What is OSWorld?

OSWorld is an agent evaluation benchmark for desktop and browser computer-use tasks. It measures whether an agent can navigate real operating-system interfaces, click correct UI elements, and complete tasks that span multiple applications.

How does OSWorld work?

Tasks run inside real operating-system environments (Linux desktop, browsers) rather than in simulated text-only sandboxes. The agent observes the screen — typically via screenshots — and emits actions like mouse clicks, keyboard input, and application launches. Success is judged by whether the final state of the OS matches the task specification, not by intermediate reasoning quality. Many tasks span multiple applications, which exposes weaknesses in cross-app context tracking that single-app benchmarks miss.

Can you give an example of OSWorld?

An agent that scores well on OSWorld can actually drive a real desktop — open a file manager, edit a spreadsheet, copy data into an email — rather than just generating plausible-looking action sequences in text.

Put this into practice

Build polished, copy-ready prompts in under 60 seconds with SurePrompts.

Try SurePrompts