Skill Detail

Evaluate long-horizon agents against WildClawBench

Use WildClawBench to benchmark agents on hard end-to-end OpenClaw tasks covering tool orchestration, multimodal work, coding, safety, and long-horizon planning.

Templates & WorkflowsOpenClaw

Templates & Workflows OpenClaw Published

⭐ 359 GitHub stars

INSTALL WITH ANY AGENT

npx skills add agentskillexchange/skills --skill evaluate-long-horizon-agents-against-wildclawbench Copy

Works best when you want a reusable capability, not another fragile one-off prompt.

View source Documentation

At a glance

Tools required

WildClawBench assets; OpenClaw environment; target agent/model under test

Install & setup

Follow the WildClawBench repository and documentation to obtain benchmark assets, configure an OpenClaw evaluation environment, run supported harnesses against the target agent, and compare results with the published task categories and leaderboard.

Author

InternLM

Publisher

Research Open Source

Last updated

May 13, 2026

Quick brief

Approve this skill for model, agent, or platform teams that need a practical evaluation harness beyond toy function-calling tests. The operator uses WildClawBench’s task set, harnesses, leaderboard materials, and report to run or compare agents in a realistic OpenClaw environment, then reviews pass rates and failure modes across agency, multimodal, long-horizon, coding, and safety categories. Use this instead of ad hoc manual testing when selecting models, validating an agent release, or tracking regressions on real multi-step tasks. The scope boundary is agent evaluation and benchmark analysis; it is not a general OpenClaw product page or a generic academic paper listing.

How it works

What this skill actually does

Inputs and prerequisites: WildClawBench assets; OpenClaw environment; target agent/model under test.

Setup notes: Follow the WildClawBench repository and documentation to obtain benchmark assets, configure an OpenClaw evaluation environment, run supported harnesses against the target agent, and compare results with the published task categories and leaderboard.

Source and verification boundary: use https://internlm.github.io/WildClawBench/ as the canonical reference before running the workflow; keep commands, API calls, CLI usage, and generated outputs reviewable against that upstream source.

Framework fit: publish this as a OpenClaw workflow only when the operator can invoke the documented toolchain directly, rather than treating the upstream project as a generic product listing.

Best fit

When to reach for it

Best when the job fits Templates & Workflows.
Works naturally with OpenClaw setups.
Requires WildClawBench assets; OpenClaw environment; target agent/model under test.
Installation is straightforward: Follow the WildClawBench repository and documentation to obtain benchmark assets, configure an OpenClaw evaluation environment, run…

Trust & provenance

Why this listing is credible

Trust status: Published.
359 GitHub stars on the linked upstream source.
Last updated May 13, 2026.

View source ↗ Documentation ↗