Skill Detail
Prove whether a prompt or model variant really won before shipping with promptstats
Run statistically sound comparisons on eval results so prompt and model changes are judged by confidence bounds, not bar-chart vibes.
Code Quality & ReviewMulti-Framework
Code Quality & Review
Multi-Framework
Security Reviewed
β 97 GitHub stars
β¬ 678/wk npm
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill prove-whether-a-prompt-or-model-variant-really-won-before-shipping-with-promptstats
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Python environment, promptstats package, eval result tables or per-input score arrays, prompt or model experiment outputs to compare
Install & setup
Install promptstats from the upstream Python package instructions, format your eval results into the documented input shape, then run the analysis methods and review the generated statistical report before making rollout decisions.
Author
Ian Arawjo
Publisher
Individual
Last updated
Apr 16, 2026
Quick brief
Use promptstats when the job is to analyze eval results and decide whether one prompt or model variant truly outperformed another, not when a user simply wants a generic benchmark dashboard. The operator workflow is crisp: feed in benchmark data, run the statistical analysis, inspect confidence bounds and pairwise comparisons, and decide whether the observed lift is real enough to act on. That scope boundary, statistical adjudication of prompt and model experiments, gives it a clear skill shape instead of reducing it to a plain library listing.