Evaluate long-horizon agents against WildClawBench
Use WildClawBench to benchmark agents on hard end-to-end OpenClaw tasks covering tool orchestration, multimodal work, coding, safety, and long-horizon planning.
npx skills add agentskillexchange/skills --skill evaluate-long-horizon-agents-against-wildclawbench
Approve this skill for model, agent, or platform teams that need a practical evaluation harness beyond toy function-calling tests. The operator uses WildClawBench’s task set, harnesses, leaderboard materials, and report to run or compare agents in a realistic OpenClaw environment, then reviews pass rates and failure modes across agency, multimodal, long-horizon, coding, and safety categories. Use this instead of ad hoc manual testing when selecting models, validating an agent release, or tracking regressions on real multi-step tasks. The scope boundary is agent evaluation and benchmark analysis; it is not a general OpenClaw product page or a generic academic paper listing.
What this skill actually does
Inputs and prerequisites: WildClawBench assets; OpenClaw environment; target agent/model under test.
Setup notes: Follow the WildClawBench repository and documentation to obtain benchmark assets, configure an OpenClaw evaluation environment, run supported harnesses against the target agent, and compare results with the published task categories and leaderboard.
Source and verification boundary: use https://internlm.github.io/WildClawBench/ as the canonical reference before running the workflow; keep commands, API calls, CLI usage, and generated outputs reviewable against that upstream source.
Framework fit: publish this as a OpenClaw workflow only when the operator can invoke the documented toolchain directly, rather than treating the upstream project as a generic product listing.