RFC: Experimentation Framework for agents-fleet
Draft / RFC
This is a design RFC, not shipped behavior. Only Phase 0-A (definition + persistence) has landed. Interfaces and flags described below may change.
- Status: Draft (Phase 0-A landed: definition + persistence)
- Owner: Fleet platform
- Scope: A first-class way to define, run, grade, and compare variants of a skill / role / crew against a fixed task suite, then promote the winner.
1. Motivation
agents-fleet already has an evolution subsystem (shadow skills, EvalRunner, EvalScheduler, EvalJudge) that answers a narrow question: "is this one auto-proposed shadow better than the current skill?" It is single-variant, single-candidate, and tightly coupled to the shadow lifecycle.
What we do not have is a general A/B/n experimentation primitive:
Given N hand-authored or generated variants of a skill/role/crew, run each over a shared, versioned task suite, grade every trial with a pluggable grader, repeat for statistical confidence, and produce a comparison report with a recommended promotion.
This RFC defines that primitive. It is reuse-first: wherever the evolution subsystem already solved a sub-problem (sandboxed runs, verification, judging, scheduling, daemon control), the experiment framework composes it rather than reimplementing it.
Reuse, honestly: at P0 the framework achieves ~30% code reuse (shared stats math, row/store conventions, config-clamp patterns). The ~80% reuse target — wrapping
EvalRunner,VerifierNode,EvalJudge,EvalSchedulerwholesale — is a P4 goal, not a P0 reality. TheGraderabstraction is a thin composition adapter over those pieces, not a unification of them: each grader delegates to an existing primitive behind a commongrade()signature; it does not merge their internals.
2. Concept model
Experiment
├─ id, name, kind: 'skill' | 'role' | 'crew'
├─ Variants[] # the things being compared (baseline + candidates)
│ └─ variantId, label, source (skill body / role md / crew spec / patch)
├─ TaskSuite # a versioned set of Cases the variants run against
│ └─ Cases[] # { caseId, prompt, expected?, criteria? }
├─ Grader # how a single trial's output → score + passed
└─ Config # repeats, maxVariants, perTrialTimeoutMs, ...
ExperimentRun # one execution of an Experiment
└─ Trial[] # one per (variant × case × repeatIdx)
└─ score, passed, grader, durationMs, tokensIn/out, outputHash, metadata
Report # derived: per-variant aggregates + winner + deltas- An Experiment is the immutable-ish definition (its
specis JSON). - A Variant is one contestant. The baseline (current skill/role/crew) is always variant 0 so every report has a reference point.
- A TaskSuite is a named, versioned bag of Cases. Versioning lets a report cite exactly which cases produced it; re-running after editing a suite bumps the version rather than mutating history.
- A Case is a single prompt plus optional
expectedanswer-key andcriteria(grader hints). Both are free-form JSON. - A Grader turns one trial's raw output into
{ score, passed }. - An ExperimentRun is one execution; it fans out into Trials, the atomic graded unit, keyed by
(variantId, caseId, repeatIdx). - A Report is derived (not persisted in P0-A) by aggregating trials.
3. Architecture (reuse-first)
| Component | Responsibility | Reuses / mirrors |
|---|---|---|
ExperimentStore | Persist experiments, suites, cases, runs, trials. | Mirrors EvalRunStore / EvolutionRunStore (*.rows.ts + prepared-statement store); registered on IntelDatabase. Uses eval_runs-style table conventions. |
ExperimentRunner | Fan out (variant × case × repeat) → execute → grade → persist trials. | Wraps EvalRunner for sandboxed execution + VerifierNode for command checks. |
VariantProvisioner | Materialize a variant (skill body / role / crew) into a runnable form. | Reuses ShadowStore provisioning + createEvolverFixture test scaffolding. |
Grader | grade(trial) → { score, passed }. Pluggable strategies. | Reuses EvalJudge (LLM verdict) and VerifierNode (deterministic command pass/fail). |
stats | Aggregate trials → per-variant mean/pass-rate + winner + deltas. | Pure functions; no new deps. Mirrors evolution win-rate math. |
scheduler | Optionally run experiments in the background, budgeted. | Reuses EvalScheduler + DaemonControlChannel (same budget/tick model). |
command | /experiment slash command surface. | Mirrors /learn command structure + *.experiment.md like *.skill.md. |
console | FleetConsole "Experiments" channel + comparison BarChart. | Reuses existing console channel + chart widgets. |
config | experiment sub-config block. | Parallel to EvolutionConfig; bounds-clamped in loadConfig. |
The dependency direction is strictly: command/console → Runner → {Provisioner, Grader, ExperimentStore} → IntelDatabase. The store layer (this RFC's P0-A) has no upward dependencies, which is why it can land and be tested first.
4. Data model (DDL)
Added via a single idempotent SchemaManager migration (create_experiment_framework), registered in the schema_migrations ledger so existing DBs upgrade on next open. All tables/indexes use IF NOT EXISTS.
experiments(
id TEXT PRIMARY KEY,
name TEXT NOT NULL,
kind TEXT NOT NULL CHECK(kind IN ('skill','role','crew')),
spec_json TEXT NOT NULL,
status TEXT NOT NULL CHECK(status IN ('draft','ready','running','complete','cancelled')),
created_at INTEGER NOT NULL,
suite_id TEXT -- denormalized from spec.suiteId for suite lookups (M2)
);
experiment_task_suites(
suite_id TEXT PRIMARY KEY,
name TEXT NOT NULL,
version INTEGER NOT NULL,
created_at INTEGER NOT NULL
);
experiment_cases(
suite_id TEXT NOT NULL,
case_id TEXT NOT NULL,
prompt TEXT NOT NULL,
expected_json TEXT,
criteria_json TEXT,
PRIMARY KEY (suite_id, case_id)
);
experiment_runs(
run_id TEXT PRIMARY KEY,
experiment_id TEXT NOT NULL,
started_at INTEGER NOT NULL,
completed_ts INTEGER,
status TEXT NOT NULL CHECK(status IN ('running','complete','cancelled','error'))
);
experiment_trials(
trial_id TEXT PRIMARY KEY,
run_id TEXT NOT NULL,
variant_id TEXT NOT NULL,
case_id TEXT NOT NULL,
repeat_idx INTEGER NOT NULL,
score REAL,
passed INTEGER, -- 0/1, NULL = ungraded
grader TEXT,
duration_ms INTEGER,
tokens_in INTEGER,
tokens_out INTEGER,
output_hash TEXT,
metadata_json TEXT
);
-- Indexes
CREATE INDEX idx_experiment_runs_experiment ON experiment_runs(experiment_id, started_at DESC);
CREATE INDEX idx_experiment_trials_run ON experiment_trials(run_id);
CREATE UNIQUE INDEX idx_experiment_trials_identity ON experiment_trials(run_id, variant_id, case_id, repeat_idx); -- H1: duplicate-trial guard
CREATE INDEX idx_experiment_cases_suite ON experiment_cases(suite_id);
CREATE INDEX idx_experiments_suite ON experiments(suite_id); -- M2
-- Production-analytics tagging (L2): forward-prep only, NULL = production.
ALTER TABLE skill_outcomes ADD COLUMN experiment_run_id TEXT;
CREATE INDEX idx_skill_outcomes_experiment_run ON skill_outcomes(experiment_run_id);Hardening migrations (post-P0 architecture review)
These ship as additional idempotent steps in the schema_migrations ledger, so a fresh db and an existing db both converge to the same shape:
create_experiment_trials_unique— replaces the non-unique(run_id, variant_id)index with a UNIQUE index on the full trial identity(run_id, variant_id, case_id, repeat_idx). The store writes trials withINSERT OR REPLACE, so re-grading a coordinate overwrites in place rather than duplicating. The oldidx_experiment_trials_run_variantis dropped (its prefix is covered by the unique index).experiments_add_suite_id/idx_experiments_suite— adds + indexes thesuite_idcolumn, populated fromspec.suiteIdon create; surfaced vialistExperimentsBySuite.skill_outcomes_add_experiment_run_id/idx_skill_outcomes_experiment_run— forward-prep tag to exclude experiment-driven trials from production analytics (NULL = production). No writer wiring yet.
Design notes
spec_jsonholds the full experiment definition (variants, grader choice, config overrides). Keeping it as opaque JSON means variant/grader schema can evolve without further migrations. The in-memoryspecis the typedExperimentSpec(H3) — the JSON is given structure at the store boundary via a cast (there is no runtime.parse(); the type is a compile-time contract only):tsinterface ExperimentSpec { variants: { id: string; label: string; patch?: string; roleName?: string; skills?: string[] }[]; grader: 'heuristic' | 'verifier' | 'judge'; suiteId: string; repeats?: number; baselineVariantId?: string; config?: Partial<ExperimentConfig>; }passedas INTEGER (not BOOLEAN — SQLite has none) withNULLreserved for ungraded trials, distinct from0(graded-and-failed). The store mapsnull → undefined,1 → true,0 → false.output_hashlets the report flag flaky / identical outputs across repeats without storing raw (potentially large / sensitive) output text.Repeats are first-class via
repeat_idx; the unique trial identity is(run_id, variant_id, case_id, repeat_idx), enforced by the UNIQUE indexidx_experiment_trials_identity(H1) — not merely by the mintedtrial_id.No FKs between
experiment_runs.experiment_idandexperiments— matching the loose-coupling convention ofeval_runs/evolution_runs, so a pruned experiment never cascades away its historical run telemetry.
5. The ExperimentStore API (P0-A — landed)
Typed, camelCase records with JSON parse/stringify for *_json columns. Every write is best-effort (try/catch swallow); every read returns a sensible default (null / []) so a corrupt row never crashes a caller. Prepared statements are stored as class properties.
Write semantics (H2): the root entities experiments, experiment_runs, experiment_task_suites use INSERT OR IGNORE — a second create with an existing id is a no-op, never a silent overwrite of historical state. The append-style children experiment_cases and experiment_trials keep INSERT OR REPLACE (re-grading a coordinate / re-adding a case is expected).
Lifecycle ownership (M6): the ExperimentRunner owns run status only (createRun / completeRun); it does not mutate the experiments.status definition row. Callers derive a live experiment status from getExperimentStatus(experimentId) (the latest run's status) or own experiments.status at the create/command layer.
// Experiments
createExperiment(rec: ExperimentRecord): void // INSERT OR IGNORE — never clobbers (H2)
getExperiment(id: string): ExperimentRecord | null
listExperiments(): ExperimentRecord[] // newest-first
listExperimentsBySuite(suiteId: string): ExperimentRecord[] // newest-first (M2)
updateExperimentStatus(id, status: ExperimentStatus): void
getExperimentStatus(experimentId): ExperimentRunStatus | null // latest run's status (M6)
// Suites / cases
createTaskSuite(rec: TaskSuiteRecord): void
getTaskSuite(suiteId: string): TaskSuiteRecord | null
addCase(rec: CaseRecord): void
listCases(suiteId: string): CaseRecord[]
// Runs
createRun(rec: ExperimentRunRecord): void
completeRun(runId, status: ExperimentRunStatus, completedTs?): void
getRun(runId: string): ExperimentRunRecord | null
listRunsForExperiment(experimentId: string): ExperimentRunRecord[] // newest-first
// Trials
insertTrial(rec: TrialRecord): void
listTrials(runId: string): TrialRecord[] // (variant, case, repeat) order
listTrialsByVariant(runId, variantId: string): TrialRecord[]String-literal-union types (no enums):
type ExperimentKind = 'skill' | 'role' | 'crew';
type ExperimentStatus = 'draft' | 'ready' | 'running' | 'complete' | 'cancelled';
type ExperimentRunStatus = 'running' | 'complete' | 'cancelled' | 'error';IntelDatabase composes ExperimentStore next to the other stores and exposes a thin facade (createExperiment, recordExperimentTrial, listExperimentTrials, completeExperimentRun, …) mirroring how recordEvalRun is surfaced.
6. The Grader abstraction
A grader maps one trial to a verdict:
interface Grader {
readonly name: string; // persisted into trial.grader
grade(input: {
case: CaseRecord;
output: string;
durationMs: number;
}): Promise<{ score: number; passed: boolean; metadata?: Record<string, unknown> }>;
}Planned built-ins (P1), all composed from existing pieces:
exact/contains— deterministic string match againstcase.expected.verifier— runscase.criteria.commandsthroughVerifierNode;passed= all commands exit 0. Reuses the evolution verifier wholesale.judge— delegates toEvalJudgefor an LLM rubric verdict, mapping itsJudgeVerdictto{ score, passed }.
grader is stored per-trial (not per-experiment) so a single run can mix graders (e.g. cheap deterministic gate first, LLM judge only on ties).
6.1 Shared Verdict type (arch rec M1)
The review noted that the framework carried four disconnected result shapes — TrialScore (runner grade), JudgeResult (EvalJudge), VerifierResult (VerifierNode), and GateResult (validation gates) — and that the Grader interface is a composition adapter, not a unification: it wraps each scorer behind grade() but never gives the four shapes a common type.
src/intel/experiment/verdict.ts supplies that common shape without breaking any producer:
interface Verdict { passed: boolean; score: number; /* 0..1 */ evidence?: string }TrialScorenowextends Verdict(its{ score, passed, evidence }fields are unchanged, so all existing usage compiles untouched).- Pure adapters map the other three shapes into a
Verdict:verdictFromJudge(j, { aIsVariant? }),verdictFromVerifier(v),verdictFromGate(g).
The underlying JudgeResult / VerifierResult / GateResult source types are left unchanged — this is a lightweight, non-breaking unification that lets future framework code treat any grader output uniformly via Verdict without refactoring the producers today.
7. CLI / console surface
/experiment slash command (P2)
/experiment create <file.experiment.md> # register an experiment from spec
/experiment run <experimentId> # execute (fan out → grade → persist)
/experiment list # table of experiments + status
/experiment report <runId> # per-variant comparison + winner
/experiment cancel <runId> # mark run cancelled, stop scheduling
/experiment promote <runId> <variantId> # apply winner via existing promotion*.experiment.md format (mirrors *.skill.md)
---
name: tighten-coder-skill
kind: skill
suite: core-coding-cases
grader: verifier
config: { repeats: 3, maxVariants: 4 }
variants:
- id: baseline # variant 0 = current skill
- id: terse-v1
patch: variants/terse-v1.md
- id: checklist-v1
patch: variants/checklist-v1.md
---
# Tighten coder skill
Free-form rationale / notes for the experiment.The YAML front-matter is parsed into spec_json; the body is documentation.
FleetConsole "Experiments" channel (P3)
A dedicated console channel that:
- lists active/recent runs with live status + progress (trials done / total);
- renders a comparison
BarChartof per-variant mean score (and a second of pass-rate), baseline highlighted, winner annotated with the delta; - surfaces confidence (n = repeats × cases) so a thin sample is visually flagged.
All of this reads exclusively through the ExperimentStore facade — no direct SQL in the console layer.
8. Configuration
ExperimentConfig lives parallel to EvolutionConfig in src/config.ts, defaulted in DEFAULT_CONFIG, merged from disk in the same loadConfig block, and bounds-clamped like the eval-daemon knobs.
interface ExperimentConfig {
enabled: boolean; // default false (master switch)
defaultRepeats: number; // default 3 (clamp 1..50)
maxVariants: number; // default 6 (clamp 1..20)
perTrialTimeoutMs: number; // default 120000 (clamp 1000..600000)
maxTrialsPerRun: number; // default 200 (clamp 1..5000)
}maxTrialsPerRun is the safety valve: variants × cases × repeats is rejected by the Runner before launch if it exceeds the cap, preventing a runaway fan-out.
9. Phased plan
| Phase | Deliverable | State |
|---|---|---|
| P0-A | Deterministic core: DDL migration, ExperimentStore (+ rows), IntelDatabase facade, ExperimentConfig, tests, this RFC. | Landed |
| P0-B | stats aggregation helpers + report data shape (pure functions over listTrials). | Shipped |
| P1 | Real runs: ExperimentRunner + LiveVariantExecutor (wraps FleetManager.spawnWorker) + graders (heuristic, verifier, judge) + token cost caps. | Shipped |
| P2 | /experiment command (create/run/report/list/dataset export/promote) + *.experiment.md parser. | Shipped |
| P3 | FleetConsole Experiments channel + comparison BarChart (via broadcastFleetExperiment frame). | Shipped |
| P4 | Promotion (/experiment promote → existing promoteShadow/evolution pipeline) + optional ExperimentScheduler-driven background runs. | Shipped |
Status (2026-06-19): all phases P0–P4 shipped to
master(PRs #573–#581 in agents-fleet, PR #2 in agents-fleet-console), CI green. Follow-ups also landed: hardening (#574), architect recs L3/L4/M1 + deep trial-tagging (#577), auto-labeled outcomes at collection +expectOutcomedeterministic grading (#578). The live-LLM path is verified by the gated evolution e2e (real +0.54 quality lift).
Each phase is independently shippable; P0-A introduces zero behavior change (the framework is enabled: false by default and nothing dispatches to it yet).
10. Non-goals (for now)
- Cross-session statistical significance testing beyond a transparent, variance-aware
meetsImprovementThresholdheuristic (renamed from the earlier, over-claimingmeetsSignificance— M3). It gates onlift > noiseFloor,lift >= minImprovement, andpooledStddev === 0 || lift >= significanceK × pooledStddev(defaultsignificanceK = 1.0); it is not a p-value / t-test. The report'swinnerSignificantflag (M5) is computed by this same heuristic over the winner-vs-baseline scores. - Distributed / multi-host trial execution — runs are in-process via the existing worker/eval machinery.
- Persisting derived Reports — they are recomputed from trials on demand; trials are the durable source of truth.
- Auto-generating variants (a future tie-in with the evolution proposer).