Skill Detail

Evaluate long-horizon agents against WildClawBench

Use WildClawBench to benchmark agents on hard end-to-end OpenClaw tasks covering tool orchestration, multimodal work, coding, safety, and long-horizon planning.

Templates & WorkflowsOpenClaw
Templates & Workflows OpenClaw Published
โญ 359 GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill evaluate-long-horizon-agents-against-wildclawbench Copy
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
WildClawBench assets; OpenClaw environment; target agent/model under test
Install & setup
Follow the WildClawBench repository and documentation to obtain benchmark assets, configure an OpenClaw evaluation environment, run supported harnesses against the target agent, and compare results with the published task categories and leaderboard.
Author
InternLM
Publisher
Research Open Source
Last updated
May 13, 2026
Quick brief

Approve this skill for model, agent, or platform teams that need a practical evaluation harness beyond toy function-calling tests. The operator uses WildClawBench’s task set, harnesses, leaderboard materials, and report to run or compare agents in a realistic OpenClaw environment, then reviews pass rates and failure modes across agency, multimodal, long-horizon, coding, and safety categories. Use this instead of ad hoc manual testing when selecting models, validating an agent release, or tracking regressions on real multi-step tasks. The scope boundary is agent evaluation and benchmark analysis; it is not a general OpenClaw product page or a generic academic paper listing.

How it works

What this skill actually does

Inputs and prerequisites: WildClawBench assets; OpenClaw environment; target agent/model under test.

Setup notes: Follow the WildClawBench repository and documentation to obtain benchmark assets, configure an OpenClaw evaluation environment, run supported harnesses against the target agent, and compare results with the published task categories and leaderboard.

Source and verification boundary: use https://internlm.github.io/WildClawBench/ as the canonical reference before running the workflow; keep commands, API calls, CLI usage, and generated outputs reviewable against that upstream source.

Framework fit: publish this as a OpenClaw workflow only when the operator can invoke the documented toolchain directly, rather than treating the upstream project as a generic product listing.