Free, open-source tools for testing AI skill quality before it reaches production. Catch regressions. Validate tool-use accuracy. Ship with confidence.
Plugs into your existing workflow. No external services, nothing sent anywhere — everything runs locally or in CI.
Write a YAML or JSON spec for your skill's expected behavior — inputs, outputs, and edge cases.
The CLI runs every test case. Correctness, tool-use accuracy, context adherence, response quality — all scored automatically.
Pass/fail status, edge case warnings, regression delta vs. baseline, and failure explanations for every failing case.
Drop the CLI into your GitHub Actions or GitLab CI pipeline. Failed skill reviews exit with a non-zero code — regressions never reach production.
npm
pip
Run your first review
Works with any LLM. Bring your own API key.
When a new model ships, don't find out your skills broke in production. Run the suite first, review the delta, then merge.
Rapid iteration shouldn't mean broken AI features. Catch regressions before your users do, without slowing your deploy cycle.
Compliance and consistency requirements mean you can't afford behavioral drift. Automated skill testing gives you an auditable record of every change and its effect on output quality.
Need a structured, repeatable way to evaluate a new skill or prompt strategy across multiple models? The Suite gives you a framework that produces comparable, shareable results.
No sign-up. No credit card. Clone the repo and run your first review in under five minutes.