Free & Open Source · Live Now

Ship AI skills you can actually trust

Free, open-source tools for testing AI skill quality before it reaches production. Catch regressions. Validate tool-use accuracy. Ship with confidence.

skill-review · run
$ skill-review run ./skills/ Discovered 4 skills · 47 test cases ──────────────────────────────────── summarize PASS 14/14 1.2s extract-entities PASS 12/12 0.9s classify-intent WARN 10/11 1.1s └ edge case: ambiguous negation tool-caller PASS 10/10 2.3s ──────────────────────────────────── Overall: 46/47 PASS 5.5s · 1 warning Regression vs. gpt-4o baseline: Δ +1.8%
// how it works

From zero to tested in minutes

Plugs into your existing workflow. No external services, nothing sent anywhere — everything runs locally or in CI.

1

Define your skill

Write a YAML or JSON spec for your skill's expected behavior — inputs, outputs, and edge cases.

2

Run the benchmark

The CLI runs every test case. Correctness, tool-use accuracy, context adherence, response quality — all scored automatically.

3

Review the report

Pass/fail status, edge case warnings, regression delta vs. baseline, and failure explanations for every failing case.

4

Block bad deploys in CI

Drop the CLI into your GitHub Actions or GitLab CI pipeline. Failed skill reviews exit with a non-zero code — regressions never reach production.

Install in seconds

npm

npm install -g @daedalus/skill-review

pip

pip install daedalus-skill-review

Run your first review

skill-review run ./skills/ --model gpt-4o

Works with any LLM. Bring your own API key.

Everything you need to
ship skills confidently

🎯
Intent Recognition Testing
Verify that your skill correctly understands user intent across a wide range of phrasings, including ambiguous and out-of-scope inputs.
🔧
Tool-Use Accuracy
For skills that invoke tools or function calls, validate that the right tool is called with the right parameters for every test case.
🔁
Regression Detection
Capture a behavioral baseline and automatically flag when a model upgrade or prompt change causes unexpected output changes.
🛡
Adversarial Coverage
Test your skill against prompt injection attempts, jailbreak patterns, and adversarial inputs designed to break expected behavior.
CI/CD Integration
First-class support for GitHub Actions, GitLab CI, and any pipeline that reads exit codes. Regressions block the merge — automatically.
🌐
Model Agnostic
Runs against OpenAI, Anthropic, Google, Mistral, Cohere, and local models via OpenAI-compatible endpoints. One spec, any model.
// who it's for

Built for teams shipping LLM features

🏗 Teams upgrading model versions

When a new model ships, don't find out your skills broke in production. Run the suite first, review the delta, then merge.

🚀 Startups moving fast

Rapid iteration shouldn't mean broken AI features. Catch regressions before your users do, without slowing your deploy cycle.

🏢 Enterprise AI teams

Compliance and consistency requirements mean you can't afford behavioral drift. Automated skill testing gives you an auditable record of every change and its effect on output quality.

🔬 AI researchers and evaluators

Need a structured, repeatable way to evaluate a new skill or prompt strategy across multiple models? The Suite gives you a framework that produces comparable, shareable results.

// faq

Common questions

An AI skill is a discrete, named capability that an LLM-powered system is expected to perform — for example, "summarize a document," "extract named entities," or "classify a support ticket." Skills are the building blocks of AI-powered features, and each one should behave consistently and correctly across a range of inputs.
Yes, completely. The Skill Review Suite is free and open source under an MIT license. You only pay for the LLM API calls you make while running tests — we don't proxy those calls or add any markup. Bring your own API key.
The Skill Review Suite is model-agnostic. It natively supports OpenAI (GPT-4o, GPT-4, GPT-3.5), Anthropic (Claude 3.5, Claude 3), Google (Gemini 1.5 Pro/Flash), and any model that exposes an OpenAI-compatible API — including locally hosted models via Ollama or LM Studio.
No. The Skill Review Suite runs entirely in your environment. Test cases, prompts, and outputs are sent only to whichever LLM provider you configure. Nothing passes through Daedalus Dynamics servers. This is by design — we know sensitive data lives in test cases.
General eval frameworks are great for research and benchmarking. The Skill Review Suite is purpose-built for engineering teams shipping product — with CI/CD as a first-class concern, tool-use validation, and a test-spec format that's as close to a unit test as possible. Think of it as the difference between a research evaluation suite and a production quality gate.
We have several tools in active development, including a multi-turn conversation tester, a skill-to-skill dependency validator, and a web-based dashboard for teams who want a visual alternative to the CLI. Star the GitHub repo to stay notified as new tools ship.
Free & Open Source

Start testing your skills
today — for free

No sign-up. No credit card. Clone the repo and run your first review in under five minutes.

View on GitHub Talk to the team