Ruby Skill Bench

Does your AI skill
actually help?

You write a SKILL.md — a Markdown guide for an AI agent. But you have no way to prove it makes the agent write better code. Ruby Skill Bench measures exactly that.

gem install ruby-skill-bench
skill-bench init --openai

Run the agent twice — once without your skill, once with it — then let an LLM judge score both outputs. The difference is your skill's ROI.


The Problem

Skills are written blind

Without a bench

  • You write a SKILL.md and hope the agent reads it
  • You read the output manually — inconsistent, slow
  • You can't tell if a skill edit helped or hurt
  • A longer skill might confuse the agent more
  • No baseline to compare against

With Ruby Skill Bench

  • Every run scores both baseline and context outputs
  • An LLM judge scores across 5 named dimensions
  • Delta shows exactly where your skill helped
  • History file tracks trends run-over-run
  • Pass/fail verdict is reproducible and auditable

Key insight

A skill that reads well to humans can still hurt the agent.

If baseline = 89 and context = 87, your skill confused the agent or added noise. Ruby Skill Bench catches this before you ship the skill. The most common "unexpected FAIL" is a skill that's too long, contradictory, or full of boilerplate the judge penalizes.


The Mental Model

One task, two runs, one judge

The engine runs every eval twice in isolated Git sandboxes, then scores both outputs independently. The judge never sees both at the same time — no halo-effect bias.

Step 1 — Baseline run

Agent receives task.md only. No skill. Produces Output A from an isolated Git sandbox.

Step 2 — Context run

Agent receives task.md + SKILL.md. Produces Output B from a fresh sandbox.

Step 3 — Blind judging

LLM judge scores Output A and Output B in two separate calls. Never sees both at once.

Step 4 — Delta + verdict

Delta = Score B − Score A. Pass if context ≥ threshold AND delta ≥ minimum.

📦

Isolated Git sandboxes

Every run uses Dir.mktmpdir to create a clean temporary repo. State changes are captured as a git diff. The host filesystem is never touched. Sandboxes are deleted after each run.


How It Works

Four commands, one result

1

Initialize configuration

Choose a provider. SkillBench creates skill-bench.json with your API key, model, timeout, and the shell commands the agent is allowed to run.

Providers: OpenAI, Anthropic, Gemini, Azure, Ollama, Groq, DeepSeek, OpenCode

skill-bench init --openai
2

Write a skill

A skill is free-form Markdown in skills/my-service/SKILL.md. It contains the patterns, hard rules, code examples, and return-format expectations you want the agent to follow.

Shorter skills usually beat longer ones — the agent has a context window.

skill-bench skill new my-service \
  --mode=rails \
  --template=service_object
3

Create an eval

An eval has two files: task.md (the specific coding task for the agent) and criteria.json (the scoring rules). Or generate both automatically from a skill.

max_score values across all dimensions must sum to exactly 100.

skill-bench eval new my-eval \
  --runtime=rails

# or auto-generate from skill:
skill-bench eval generate my-service \
  --name my-eval
4

Run the eval

SkillBench runs the baseline, then the context agent, judges both outputs, computes deltas, records history, and prints the verdict table.

Results are appended to .skill-bench-history.json for trend tracking.

skill-bench run my-eval \
  --skill=my-service

# chain multiple skills:
skill-bench run my-eval \
  --skill=skill-a \
  --skill=skill-b

What lands on disk

project-root/
├── skill-bench.json              # you write once (init)
├── skills/
│   └── my-service/
│       └── SKILL.md              # you write and iterate
├── evals/
│   └── my-eval/
│       ├── task.md               # agent's coding task
│       └── criteria.json         # scoring rules
└── .skill-bench-history.json     # auto-generated, never edit

The Scoring Engine

Five dimensions, always blind

Every criteria.json must include these five dimensions. Their max_score values must sum to 100. You can add custom dimensions (e.g. performance) as long as the total remains 100.

Correctness

25–35

Does the output fulfill the task requirements? Are all specified behaviors present and correct?

Skill Adherence

20–30

Did the agent follow the specific patterns, hard gates, and workflows defined in the skill?

Code Quality

15–25

Is the code clean, well-structured, free of smells, follows SRP, and avoids duplication?

Test Coverage

10–20

Are there meaningful tests? Do they test the right things? Are they following TDD best practices?

Documentation

5–15

Is there adequate YARD documentation, clear intent, and helpful inline comments where needed?

Custom dimension

optional

Add any domain-specific dimension (e.g. performance, security). Still counts toward the 100-point total.

Pass/fail logic — both conditions must be true

Condition 1

context_total ≥ pass_threshold

The agent with skill scored high enough. Default threshold: 70.

Condition 2

total_delta ≥ minimum_delta

The skill made a meaningful difference. Default minimum: 10.

Verdict scenarios

Context score Delta Verdict Why
87 +55 PASS Both conditions met. Skill helped a lot.
87 −2 FAIL Score fine, but skill made it worse.
65 +15 FAIL Delta good, absolute score too low.
65 +5 FAIL Both conditions failed.

Concrete Example

A skill that works: 32 → 87

This is real output from running a service-object skill eval. The agent without context scored 32/100. With the skill, it scored 87/100 — a +55 delta across all five dimensions.

Dimension Baseline Context Delta
Correctness (30) 12 28 +16
Skill Adherence (25) 5 22 +17
Code Quality (20) 10 16 +6
Test Coverage (15) 3 13 +10
Documentation (10) 2 8 +6
TOTAL 32/100 87/100 +55
VERDICT: PASS threshold: 70  ·  minimum delta: 10  ·  TREND: context ↑ (+7)

What the TREND line tells you

After each run, SkillBench appends to .skill-bench-history.json. The TREND compares the current run against the previous run of the same eval + skill. ↑ (+7) means the context score improved by 7 points since the last run — your last SKILL.md edit worked. → (0) means the score is stable. A negative trend means a recent edit made things worse.


Under the Hood

The agent that runs each eval

ReAct loop: Thought → Tool → Observation

The agent is not a one-shot prompt. It uses a stateful Thought → Tool → Observation loop to handle multi-step engineering tasks — reading files, writing code, running tests, and checking output.

Thought plan next action Tool read_file write_file Observe result of tool call loop until done

Multi-provider ecosystem

The BaseClient uses the Template Method pattern so every provider shares the same connection setup, error logging, and response normalization. Swap providers by editing one line in skill-bench.json.

OpenAI
Anthropic
Google Gemini
Azure OpenAI
Ollama (local)
Groq
DeepSeek
OpenCode

Security model

Command allowlist

By default, no shell commands are permitted. You explicitly list what's allowed in skill-bench.json — e.g. "rspec", "bundle", "git". Dangerous commands (bash, curl, sudo) are always blocked.

Atomic history writes

History is written with file locking. If you kill the process mid-write, SkillBench recovers from the auto-generated .skill-bench-history.json.bak backup.

Path validation

Eval paths are validated to prevent directory traversal attacks. The agent cannot escape the sandbox.

373+ tests

Core engine, CLI commands, and all provider clients are covered. Tests use WebMock for HTTP stubbing.


Iterating on Skills

Skills improve through feedback

A good skill is rarely written in one pass. The history file is your guide. Focus on the dimension with the smallest delta — that is where your skill is weakest.

1

Run the eval and read the table

Look at two things: (1) did it pass? (2) which dimension has the smallest delta?

2

Inspect history

cat .skill-bench-history.json | jq '.[-1]' — focus on the weakest dimension.

3

Edit SKILL.md with a concrete rule

If test_coverage delta was +3, add a rule: "Include at least one happy-path and one error-path test using let and subject."

4

Re-run and watch the trend

If test_coverage jumped from +3 to +8 and the TREND shows context ↑ (+5), the edit worked.

Stop when stable

Context score ~95+ and deltas are flat across 2-3 consecutive runs. Your skill is mature.


What to do next

Start in five minutes

Install

gem install ruby-skill-bench

Requires Ruby 3.1+.

Initialize

skill-bench init --openai

Creates skill-bench.json. Set your API key.

Run your first eval

skill-bench run my-eval \
  --skill=my-service

Watch the delta table.

Output formats for CI/CD

skill-bench run my-eval --skill=x --format json --format junit

JUnit XML integrates with GitHub Actions. The repo includes a CI workflow at .github/workflows/ci.yml.

"A skill that only beats baseline marginally is under-specified — it should change the model's output meaningfully."

— Ruby Skill Bench docs