Skill Detail

Drive web and app UIs with vision-grounded steps when selectors are brittle or unavailable

Use Midscene.js when an agent needs screenshot-grounded UI actions and assertions across web, mobile, or desktop surfaces where DOM selectors are fragile, unavailable, or not the right abstraction.

Browser AutomationMulti-Framework
Browser Automation Multi-Framework Security Reviewed
⭐ 12.6k GitHub stars ⬇ 83.7k/wk npm
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill drive-web-and-app-uis-with-vision-grounded-steps-when-selectors-are-brittle-or-unavailable Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Midscene.js, Node.js, a supported vision model, and a target automation surface such as Playwright, Puppeteer, Android adb, or iOS WebDriverAgent
Install & setup
Install the core package with `npm install @midscene/core`, connect it to your browser or device automation surface using the upstream setup guide, then author natural-language UI actions, assertions, and extraction steps through the SDK, YAML flow, or playground tooling.
Author
web-infra-dev
Publisher
Organization
Last updated
Apr 14, 2026
Quick brief

Use Midscene.js when the workflow depends on visual understanding instead of stable selectors. It lets an agent describe goals in natural language, operate interfaces through screenshot-based localization, extract data, assert outcomes, and replay runs across browser, Android, iOS, and other UI surfaces. The scope boundary is specific enough to avoid being just another browser framework listing: this skill is for vision-driven UI action authoring and debugging when selector-first automation breaks down, not for promoting a general product platform.