Ruby Skill Bench
You write a SKILL.md — a Markdown guide for an AI agent. But you have no way to prove it makes the agent write better code. Ruby Skill Bench measures exactly that.
Run the agent twice — once without your skill, once with it — then let an LLM judge score both outputs. The difference is your skill's ROI.
The Problem
Without a bench
With Ruby Skill Bench
Key insight
A skill that reads well to humans can still hurt the agent.
If baseline = 89 and context = 87, your skill confused the agent or added noise. Ruby Skill Bench catches this before you ship the skill. The most common "unexpected FAIL" is a skill that's too long, contradictory, or full of boilerplate the judge penalizes.
The Mental Model
The engine runs every eval twice in isolated Git sandboxes, then scores both outputs independently. The judge never sees both at the same time — no halo-effect bias.
Step 1 — Baseline run
Agent receives task.md only. No skill. Produces Output A from an isolated Git sandbox.
Step 2 — Context run
Agent receives task.md + SKILL.md. Produces Output B from a fresh sandbox.
Step 3 — Blind judging
LLM judge scores Output A and Output B in two separate calls. Never sees both at once.
Step 4 — Delta + verdict
Delta = Score B − Score A. Pass if context ≥ threshold AND delta ≥ minimum.
Isolated Git sandboxes
Every run uses Dir.mktmpdir to create a clean temporary repo. State changes are captured as a git diff. The host filesystem is never touched. Sandboxes are deleted after each run.
How It Works
Initialize configuration
Choose a provider. SkillBench creates skill-bench.json with your API key, model, timeout, and the shell commands the agent is allowed to run.
Providers: OpenAI, Anthropic, Gemini, Azure, Ollama, Groq, DeepSeek, OpenCode
skill-bench init --openai
Write a skill
A skill is free-form Markdown in skills/my-service/SKILL.md. It contains the patterns, hard rules, code examples, and return-format expectations you want the agent to follow.
Shorter skills usually beat longer ones — the agent has a context window.
skill-bench skill new my-service \
--mode=rails \
--template=service_object
Create an eval
An eval has two files: task.md (the specific coding task for the agent) and criteria.json (the scoring rules). Or generate both automatically from a skill.
max_score values across all dimensions must sum to exactly 100.
skill-bench eval new my-eval \
--runtime=rails
# or auto-generate from skill:
skill-bench eval generate my-service \
--name my-eval
Run the eval
SkillBench runs the baseline, then the context agent, judges both outputs, computes deltas, records history, and prints the verdict table.
Results are appended to .skill-bench-history.json for trend tracking.
skill-bench run my-eval \
--skill=my-service
# chain multiple skills:
skill-bench run my-eval \
--skill=skill-a \
--skill=skill-b
What lands on disk
project-root/
├── skill-bench.json # you write once (init)
├── skills/
│ └── my-service/
│ └── SKILL.md # you write and iterate
├── evals/
│ └── my-eval/
│ ├── task.md # agent's coding task
│ └── criteria.json # scoring rules
└── .skill-bench-history.json # auto-generated, never edit
The Scoring Engine
Every criteria.json must include these five dimensions. Their max_score values must sum to 100. You can add custom dimensions (e.g. performance) as long as the total remains 100.
Correctness
25–35Does the output fulfill the task requirements? Are all specified behaviors present and correct?
Skill Adherence
20–30Did the agent follow the specific patterns, hard gates, and workflows defined in the skill?
Code Quality
15–25Is the code clean, well-structured, free of smells, follows SRP, and avoids duplication?
Test Coverage
10–20Are there meaningful tests? Do they test the right things? Are they following TDD best practices?
Documentation
5–15Is there adequate YARD documentation, clear intent, and helpful inline comments where needed?
Custom dimension
optionalAdd any domain-specific dimension (e.g. performance, security). Still counts toward the 100-point total.
Pass/fail logic — both conditions must be true
Condition 1
context_total ≥ pass_threshold
The agent with skill scored high enough. Default threshold: 70.
Condition 2
total_delta ≥ minimum_delta
The skill made a meaningful difference. Default minimum: 10.
Verdict scenarios
| Context score | Delta | Verdict | Why |
|---|---|---|---|
| 87 | +55 | PASS | Both conditions met. Skill helped a lot. |
| 87 | −2 | FAIL | Score fine, but skill made it worse. |
| 65 | +15 | FAIL | Delta good, absolute score too low. |
| 65 | +5 | FAIL | Both conditions failed. |
Concrete Example
This is real output from running a service-object skill eval. The agent without context scored 32/100. With the skill, it scored 87/100 — a +55 delta across all five dimensions.
| Dimension | Baseline | Context | Delta |
|---|---|---|---|
| Correctness (30) | 12 | 28 | +16 |
| Skill Adherence (25) | 5 | 22 | +17 |
| Code Quality (20) | 10 | 16 | +6 |
| Test Coverage (15) | 3 | 13 | +10 |
| Documentation (10) | 2 | 8 | +6 |
| TOTAL | 32/100 | 87/100 | +55 |
What the TREND line tells you
After each run, SkillBench appends to .skill-bench-history.json. The TREND compares the current run against the previous run of the same eval + skill. ↑ (+7) means the context score improved by 7 points since the last run — your last SKILL.md edit worked. → (0) means the score is stable. A negative trend means a recent edit made things worse.
Under the Hood
ReAct loop: Thought → Tool → Observation
The agent is not a one-shot prompt. It uses a stateful Thought → Tool → Observation loop to handle multi-step engineering tasks — reading files, writing code, running tests, and checking output.
Multi-provider ecosystem
The BaseClient uses the Template Method pattern so every provider shares the same connection setup, error logging, and response normalization. Swap providers by editing one line in skill-bench.json.
Security model
Command allowlist
By default, no shell commands are permitted. You explicitly list what's allowed in skill-bench.json — e.g. "rspec", "bundle", "git". Dangerous commands (bash, curl, sudo) are always blocked.
Atomic history writes
History is written with file locking. If you kill the process mid-write, SkillBench recovers from the auto-generated .skill-bench-history.json.bak backup.
Path validation
Eval paths are validated to prevent directory traversal attacks. The agent cannot escape the sandbox.
373+ tests
Core engine, CLI commands, and all provider clients are covered. Tests use WebMock for HTTP stubbing.
Iterating on Skills
A good skill is rarely written in one pass. The history file is your guide. Focus on the dimension with the smallest delta — that is where your skill is weakest.
Run the eval and read the table
Look at two things: (1) did it pass? (2) which dimension has the smallest delta?
Inspect history
cat .skill-bench-history.json | jq '.[-1]' — focus on the weakest dimension.
Edit SKILL.md with a concrete rule
If test_coverage delta was +3, add a rule: "Include at least one happy-path and one error-path test using let and subject."
Re-run and watch the trend
If test_coverage jumped from +3 to +8 and the TREND shows context ↑ (+5), the edit worked.
Stop when stable
Context score ~95+ and deltas are flat across 2-3 consecutive runs. Your skill is mature.
What to do next
Install
gem install ruby-skill-bench
Requires Ruby 3.1+.
Initialize
skill-bench init --openai
Creates skill-bench.json. Set your API key.
Run your first eval
skill-bench run my-eval \
--skill=my-service
Watch the delta table.
Output formats for CI/CD
skill-bench run my-eval --skill=x
--format json
--format junit
JUnit XML integrates with GitHub Actions. The repo includes a CI workflow at .github/workflows/ci.yml.
"A skill that only beats baseline marginally is under-specified — it should change the model's output meaningfully."
— Ruby Skill Bench docs