Skill Detail
Benchmark deep research agents across factual, quality, and process dimensions with MiroEval
Score deep research agents on benchmark tasks using factual verification, report-quality scoring, and process evaluation before model or workflow changes ship.
Code Quality & ReviewMulti-Framework
Code Quality & Review
Multi-Framework
Published
β 34 GitHub stars
INSTALL WITH ANY AGENT
npx skills add agentskillexchange/skills --skill benchmark-deep-research-agents-across-factual-quality-and-process-dimensions-with-miroeval
Works best when you want a reusable capability, not another fragile one-off prompt.
At a glance
Tools required
Python, uv, model result JSON, required API keys for judge and retrieval services
Install & setup
Run `uv sync`, copy `.env.template` to `.env` and add the required API keys, prepare a model-results JSON file, then execute `bash run_eval.sh –input data/method_results/my_model.json –model_name my_model`.
Author
MiroMindAI
Publisher
Organization
Last updated
Apr 18, 2026
Quick brief
Use MiroEval when you need to benchmark a deep research system against a fixed task set and score not just the final report but also factual correctness and research process quality. Invoke it instead of normal product usage when the job is comparative evaluation of research-agent outputs before rollout or model changes, not general web research. The boundary is a benchmarked deep-research evaluation workflow with defined input/result schemas and scoring dimensions, which keeps it skill-shaped rather than a generic platform listing.