Use Prompt Flow for LLM workflow testing and evaluation
Build a Prompt Flow graph, run interactive and batch tests, inspect traces and evaluation metrics, and promote only reviewed LLM workflow versions.
npx skills add agentskillexchange/skills --skill use-prompt-flow-for-llm-workflow-testing-and-evaluation
Use Prompt Flow when an operator needs a reviewable development loop for an LLM workflow that links prompts, LLM calls, Python code, and tool steps. The workflow is to create or load a flow, configure provider connections, run interactive tests, evaluate the flow over a dataset, inspect traces and metrics, then use the results to decide whether a prompt, model, or tool-chain change is ready to ship. Invoke this instead of editing prompts directly in an app when the team needs traceable runs, repeatable evaluation data, and CI-friendly quality checks before production deployment. Good runs identify the flow version, dataset, model connection, metrics, failing examples, and approval decision. The scope boundary is LLM flow prototyping, testing, tracing, and evaluation. It is not a generic Microsoft platform card, a catch-all Azure AI listing, or a replacement for application-specific release approval.
What this skill actually does
Inputs and prerequisites: Prompt Flow CLI, Python 3.9 through 3.11, promptflow and promptflow-tools packages, configured OpenAI or Azure OpenAI connection, optional VS Code extension..
Setup notes: Install with `pip install promptflow promptflow-tools`, initialize a flow with `pf flow init –flow ./my_chatbot –type chat`, create the provider connection with `pf connection create`, then run `pf flow test –flow ./my_chatbot –interactive` before adding dataset evaluation.
Source and verification boundary: use https://microsoft.github.io/promptflow/index.html as the canonical reference before running the workflow; keep commands, API calls, CLI usage, and generated outputs reviewable against that upstream source.
Framework fit: publish this as a Multi-Framework workflow only when the operator can invoke the documented toolchain directly, rather than treating the upstream project as a generic product listing.