Evals Skills for Coding Agents
Teach your coding agent evals.
Today, I’m publishing evals-skills, a set of skills that for AI product evals [1]. They distill what I’ve learned helping 50+ companies and teaching 4,000+ students build evaluation systems.
Why Skills for Evals
Coding agents now instrument applications, run experiments, analyze data, and build interfaces. I’ve been pointing them at evals.
OpenAI’s Harness Engineering article makes the case well: they built a product entirely with Codex agents (~1 million lines of code, 1,500 PRs, three engineers, five months) and found that improving the infrastructure around the agent yielded better returns than improving the model. Their agents queried distributed traces to verify their own work against runtime evidence. Documentation tells the agent what to do. Telemetry tells it whether it worked. Evals apply the same principle to AI output quality.
All major eval vendors now ship an MCP server [2]. The tedious parts: instrumenting your app, orchestrating experiments, building annotation tools, now belong to coding agents.
But an agent with an eval platform still needs to know what to do with it. Say a support bot tells a customer “your plan includes free returns” when it doesn’t. Another says “I’ve canceled your order” when nobody asked. Both are hallucinations, but one gets a fact wrong and the other makes up a user action. If you lump them together in a generic “hallucination score” real problems will likely go undetected.
These skills fill in the gaps. They complement the vendor MCP servers: those give your agent access to traces and experiments, these teach it what to do with them.
The Skills
If you’re new to evals or inheriting an existing eval pipeline, start with eval-audit. It inspects your current setup (or lack of one), runs diagnostic checks across six areas (error analysis, evaluator design, judge validation, human review, labeled data, pipeline hygiene), and produces a prioritized list of problems with next steps. Install the skills and give your agent this prompt:
Install the eval skills plugin from https://github.com/hamelsmu/evals-skills, then run /evals-skills:eval-audit on my eval pipeline. Investigate each diagnostic area using a separate subagent in parallel, then synthesize the findings into a single report. Use other skills in the plugin as recommended by the audit.
If you’re experienced with evals, you can skip the audit and pick the skill you need:
error-analysis: Read traces, categorize failures, build a vocabulary of what’s broken
generate-synthetic-data: Create diverse test inputs when real data is sparse
write-judge-prompt: Design binary Pass/Fail LLM-as-Judge evaluators
validate-evaluator: Calibrate judges against human labels using TPR/TNR and bias correction
evaluate-rag: Evaluate retrieval and generation quality separately
build-review-interface: Generate annotation interfaces for human trace review
These skills are a starting point. They only cover parts of evals that generalize across projects. Skills grounded in your stack, your domain, and your data will outperform them. Start here, then write your own. Our AI Evals course teaches the end to end workflow.
👉 The repo is here: github.com/hamelsmu/evals-skills 👈
P.S. Our next AI Evals course cohort starts March 16th. The one after that won’t run until fall. You’ll get lifetime access to all materials and learn with students from OpenAI, Meta, Google, Walmart, Airbnb and more. Use code newsletter-25 for a discount.
Footnotes
Not foundation model benchmarks like MMLU or HELM that measure general LLM capabilities. Product evals measure whether your pipeline works on your task with your data. If you aren’t familiar with product-specific AI evals, check out my AI Evals FAQ.
Braintrust, LangSmith, Phoenix, Truesight, and others.


