RFC: Experimentation Framework for agents-fleet

Draft / RFC

This is a design RFC, not shipped behavior. Only Phase 0-A (definition + persistence) has landed. Interfaces and flags described below may change.

Status: Draft (Phase 0-A landed: definition + persistence)
Owner: Fleet platform
Scope: A first-class way to define, run, grade, and compare variants of a skill / role / crew against a fixed task suite, then promote the winner.

1. Motivation

agents-fleet already has an evolution subsystem (shadow skills, EvalRunner, EvalScheduler, EvalJudge) that answers a narrow question: "is this one auto-proposed shadow better than the current skill?" It is single-variant, single-candidate, and tightly coupled to the shadow lifecycle.

What we do not have is a general A/B/n experimentation primitive:

Given N hand-authored or generated variants of a skill/role/crew, run each over a shared, versioned task suite, grade every trial with a pluggable grader, repeat for statistical confidence, and produce a comparison report with a recommended promotion.

This RFC defines that primitive. It is reuse-first: wherever the evolution subsystem already solved a sub-problem (sandboxed runs, verification, judging, scheduling, daemon control), the experiment framework composes it rather than reimplementing it.

Reuse, honestly: at P0 the framework achieves ~30% code reuse (shared stats math, row/store conventions, config-clamp patterns). The ~80% reuse target — wrapping EvalRunner, VerifierNode, EvalJudge, EvalScheduler wholesale — is a P4 goal, not a P0 reality. The Grader abstraction is a thin composition adapter over those pieces, not a unification of them: each grader delegates to an existing primitive behind a common grade() signature; it does not merge their internals.

2. Concept model

Experiment
 ├─ id, name, kind: 'skill' | 'role' | 'crew'
 ├─ Variants[]            # the things being compared (baseline + candidates)
 │   └─ variantId, label, source (skill body / role md / crew spec / patch)
 ├─ TaskSuite             # a versioned set of Cases the variants run against
 │   └─ Cases[]           # { caseId, prompt, expected?, criteria? }
 ├─ Grader                # how a single trial's output → score + passed
 └─ Config                # repeats, maxVariants, perTrialTimeoutMs, ...

ExperimentRun             # one execution of an Experiment
 └─ Trial[]               # one per (variant × case × repeatIdx)
     └─ score, passed, grader, durationMs, tokensIn/out, outputHash, metadata

Report                    # derived: per-variant aggregates + winner + deltas

An Experiment is the immutable-ish definition (its spec is JSON).
A Variant is one contestant. The baseline (current skill/role/crew) is always variant 0 so every report has a reference point.
A TaskSuite is a named, versioned bag of Cases. Versioning lets a report cite exactly which cases produced it; re-running after editing a suite bumps the version rather than mutating history.
A Case is a single prompt plus optional expected answer-key and criteria (grader hints). Both are free-form JSON.
A Grader turns one trial's raw output into { score, passed }.
An ExperimentRun is one execution; it fans out into Trials, the atomic graded unit, keyed by (variantId, caseId, repeatIdx).
A Report is derived (not persisted in P0-A) by aggregating trials.

3. Architecture (reuse-first)

Component	Responsibility	Reuses / mirrors
`ExperimentStore`	Persist experiments, suites, cases, runs, trials.	Mirrors `EvalRunStore` / `EvolutionRunStore` (`*.rows.ts` + prepared-statement store); registered on `IntelDatabase`. Uses `eval_runs`-style table conventions.
`ExperimentRunner`	Fan out (variant × case × repeat) → execute → grade → persist trials.	Wraps `EvalRunner` for sandboxed execution + `VerifierNode` for command checks.
`VariantProvisioner`	Materialize a variant (skill body / role / crew) into a runnable form.	Reuses `ShadowStore` provisioning + `createEvolverFixture` test scaffolding.
`Grader`	`grade(trial) → { score, passed }`. Pluggable strategies.	Reuses `EvalJudge` (LLM verdict) and `VerifierNode` (deterministic command pass/fail).
`stats`	Aggregate trials → per-variant mean/pass-rate + winner + deltas.	Pure functions; no new deps. Mirrors evolution win-rate math.
`scheduler`	Optionally run experiments in the background, budgeted.	Reuses `EvalScheduler` + `DaemonControlChannel` (same budget/tick model).
`command`	`/experiment` slash command surface.	Mirrors `/learn` command structure + `.experiment.md` like `.skill.md`.
`console`	FleetConsole "Experiments" channel + comparison `BarChart`.	Reuses existing console channel + chart widgets.
`config`	`experiment` sub-config block.	Parallel to `EvolutionConfig`; bounds-clamped in `loadConfig`.

The dependency direction is strictly: command/console → Runner → {Provisioner, Grader, ExperimentStore} → IntelDatabase. The store layer (this RFC's P0-A) has no upward dependencies, which is why it can land and be tested first.

4. Data model (DDL)

Added via a single idempotent SchemaManager migration (create_experiment_framework), registered in the schema_migrations ledger so existing DBs upgrade on next open. All tables/indexes use IF NOT EXISTS.

sql

experiments(
  id TEXT PRIMARY KEY,
  name TEXT NOT NULL,
  kind TEXT NOT NULL CHECK(kind IN ('skill','role','crew')),
  spec_json TEXT NOT NULL,
  status TEXT NOT NULL CHECK(status IN ('draft','ready','running','complete','cancelled')),
  created_at INTEGER NOT NULL,
  suite_id TEXT                 -- denormalized from spec.suiteId for suite lookups (M2)
);

experiment_task_suites(
  suite_id TEXT PRIMARY KEY,
  name TEXT NOT NULL,
  version INTEGER NOT NULL,
  created_at INTEGER NOT NULL
);

experiment_cases(
  suite_id TEXT NOT NULL,
  case_id TEXT NOT NULL,
  prompt TEXT NOT NULL,
  expected_json TEXT,
  criteria_json TEXT,
  PRIMARY KEY (suite_id, case_id)
);

experiment_runs(
  run_id TEXT PRIMARY KEY,
  experiment_id TEXT NOT NULL,
  started_at INTEGER NOT NULL,
  completed_ts INTEGER,
  status TEXT NOT NULL CHECK(status IN ('running','complete','cancelled','error'))
);

experiment_trials(
  trial_id TEXT PRIMARY KEY,
  run_id TEXT NOT NULL,
  variant_id TEXT NOT NULL,
  case_id TEXT NOT NULL,
  repeat_idx INTEGER NOT NULL,
  score REAL,
  passed INTEGER,            -- 0/1, NULL = ungraded
  grader TEXT,
  duration_ms INTEGER,
  tokens_in INTEGER,
  tokens_out INTEGER,
  output_hash TEXT,
  metadata_json TEXT
);

-- Indexes
CREATE INDEX        idx_experiment_runs_experiment      ON experiment_runs(experiment_id, started_at DESC);
CREATE INDEX        idx_experiment_trials_run           ON experiment_trials(run_id);
CREATE UNIQUE INDEX idx_experiment_trials_identity      ON experiment_trials(run_id, variant_id, case_id, repeat_idx);  -- H1: duplicate-trial guard
CREATE INDEX        idx_experiment_cases_suite          ON experiment_cases(suite_id);
CREATE INDEX        idx_experiments_suite               ON experiments(suite_id);                                       -- M2

-- Production-analytics tagging (L2): forward-prep only, NULL = production.
ALTER TABLE skill_outcomes ADD COLUMN experiment_run_id TEXT;
CREATE INDEX idx_skill_outcomes_experiment_run ON skill_outcomes(experiment_run_id);

Hardening migrations (post-P0 architecture review)

These ship as additional idempotent steps in the schema_migrations ledger, so a fresh db and an existing db both converge to the same shape:

create_experiment_trials_unique — replaces the non-unique (run_id, variant_id) index with a UNIQUE index on the full trial identity (run_id, variant_id, case_id, repeat_idx). The store writes trials with INSERT OR REPLACE, so re-grading a coordinate overwrites in place rather than duplicating. The old idx_experiment_trials_run_variant is dropped (its prefix is covered by the unique index).
experiments_add_suite_id / idx_experiments_suite — adds + indexes the suite_id column, populated from spec.suiteId on create; surfaced via listExperimentsBySuite.
skill_outcomes_add_experiment_run_id / idx_skill_outcomes_experiment_run — forward-prep tag to exclude experiment-driven trials from production analytics (NULL = production). No writer wiring yet.

Design notes

spec_json holds the full experiment definition (variants, grader choice, config overrides). Keeping it as opaque JSON means variant/grader schema can evolve without further migrations. The in-memory spec is the typed ExperimentSpec (H3) — the JSON is given structure at the store boundary via a cast (there is no runtime .parse(); the type is a compile-time contract only):
ts
```
interface ExperimentSpec {
  variants: { id: string; label: string; patch?: string; roleName?: string; skills?: string[] }[];
  grader: 'heuristic' | 'verifier' | 'judge';
  suiteId: string;
  repeats?: number;
  baselineVariantId?: string;
  config?: Partial<ExperimentConfig>;
}
```
passed as INTEGER (not BOOLEAN — SQLite has none) with NULL reserved for ungraded trials, distinct from 0 (graded-and-failed). The store maps null → undefined, 1 → true, 0 → false.
output_hash lets the report flag flaky / identical outputs across repeats without storing raw (potentially large / sensitive) output text.
Repeats are first-class via repeat_idx; the unique trial identity is (run_id, variant_id, case_id, repeat_idx), enforced by the UNIQUE index idx_experiment_trials_identity (H1) — not merely by the minted trial_id.
No FKs between experiment_runs.experiment_id and experiments — matching the loose-coupling convention of eval_runs/evolution_runs, so a pruned experiment never cascades away its historical run telemetry.

5. The `ExperimentStore` API (P0-A — landed)

Typed, camelCase records with JSON parse/stringify for *_json columns. Every write is best-effort (try/catch swallow); every read returns a sensible default (null / []) so a corrupt row never crashes a caller. Prepared statements are stored as class properties.

Write semantics (H2): the root entities experiments, experiment_runs, experiment_task_suites use INSERT OR IGNORE — a second create with an existing id is a no-op, never a silent overwrite of historical state. The append-style children experiment_cases and experiment_trials keep INSERT OR REPLACE (re-grading a coordinate / re-adding a case is expected).

Lifecycle ownership (M6): the ExperimentRunner owns run status only (createRun / completeRun); it does not mutate the experiments.status definition row. Callers derive a live experiment status from getExperimentStatus(experimentId) (the latest run's status) or own experiments.status at the create/command layer.

// Experiments
createExperiment(rec: ExperimentRecord): void        // INSERT OR IGNORE — never clobbers (H2)
getExperiment(id: string): ExperimentRecord | null
listExperiments(): ExperimentRecord[]                       // newest-first
listExperimentsBySuite(suiteId: string): ExperimentRecord[] // newest-first (M2)
updateExperimentStatus(id, status: ExperimentStatus): void
getExperimentStatus(experimentId): ExperimentRunStatus | null  // latest run's status (M6)

// Suites / cases
createTaskSuite(rec: TaskSuiteRecord): void
getTaskSuite(suiteId: string): TaskSuiteRecord | null
addCase(rec: CaseRecord): void
listCases(suiteId: string): CaseRecord[]

// Runs
createRun(rec: ExperimentRunRecord): void
completeRun(runId, status: ExperimentRunStatus, completedTs?): void
getRun(runId: string): ExperimentRunRecord | null
listRunsForExperiment(experimentId: string): ExperimentRunRecord[]   // newest-first

// Trials
insertTrial(rec: TrialRecord): void
listTrials(runId: string): TrialRecord[]                    // (variant, case, repeat) order
listTrialsByVariant(runId, variantId: string): TrialRecord[]

String-literal-union types (no enums):

type ExperimentKind      = 'skill' | 'role' | 'crew';
type ExperimentStatus    = 'draft' | 'ready' | 'running' | 'complete' | 'cancelled';
type ExperimentRunStatus = 'running' | 'complete' | 'cancelled' | 'error';

IntelDatabase composes ExperimentStore next to the other stores and exposes a thin facade (createExperiment, recordExperimentTrial, listExperimentTrials, completeExperimentRun, …) mirroring how recordEvalRun is surfaced.

6. The Grader abstraction

A grader maps one trial to a verdict:

interface Grader {
  readonly name: string;                       // persisted into trial.grader
  grade(input: {
    case: CaseRecord;
    output: string;
    durationMs: number;
  }): Promise<{ score: number; passed: boolean; metadata?: Record<string, unknown> }>;
}

Planned built-ins (P1), all composed from existing pieces:

exact / contains — deterministic string match against case.expected.
verifier — runs case.criteria.commands through VerifierNode; passed = all commands exit 0. Reuses the evolution verifier wholesale.
judge — delegates to EvalJudge for an LLM rubric verdict, mapping its JudgeVerdict to { score, passed }.

grader is stored per-trial (not per-experiment) so a single run can mix graders (e.g. cheap deterministic gate first, LLM judge only on ties).

6.1 Shared `Verdict` type (arch rec M1)

The review noted that the framework carried four disconnected result shapes — TrialScore (runner grade), JudgeResult (EvalJudge), VerifierResult (VerifierNode), and GateResult (validation gates) — and that the Grader interface is a composition adapter, not a unification: it wraps each scorer behind grade() but never gives the four shapes a common type.

src/intel/experiment/verdict.ts supplies that common shape without breaking any producer:

interface Verdict { passed: boolean; score: number; /* 0..1 */ evidence?: string }

TrialScore now extends Verdict (its { score, passed, evidence } fields are unchanged, so all existing usage compiles untouched).
Pure adapters map the other three shapes into a Verdict: verdictFromJudge(j, { aIsVariant? }), verdictFromVerifier(v), verdictFromGate(g).

The underlying JudgeResult / VerifierResult / GateResult source types are left unchanged — this is a lightweight, non-breaking unification that lets future framework code treat any grader output uniformly via Verdict without refactoring the producers today.

7. CLI / console surface

`/experiment` slash command (P2)

/experiment create <file.experiment.md>     # register an experiment from spec
/experiment run <experimentId>              # execute (fan out → grade → persist)
/experiment list                            # table of experiments + status
/experiment report <runId>                  # per-variant comparison + winner
/experiment cancel <runId>                  # mark run cancelled, stop scheduling
/experiment promote <runId> <variantId>     # apply winner via existing promotion

`.experiment.md` format (mirrors `.skill.md`)

markdown

---
name: tighten-coder-skill
kind: skill
suite: core-coding-cases
grader: verifier
config: { repeats: 3, maxVariants: 4 }
variants:
  - id: baseline            # variant 0 = current skill
  - id: terse-v1
    patch: variants/terse-v1.md
  - id: checklist-v1
    patch: variants/checklist-v1.md
---
# Tighten coder skill
Free-form rationale / notes for the experiment.

The YAML front-matter is parsed into spec_json; the body is documentation.

FleetConsole "Experiments" channel (P3)

A dedicated console channel that:

lists active/recent runs with live status + progress (trials done / total);
renders a comparison BarChart of per-variant mean score (and a second of pass-rate), baseline highlighted, winner annotated with the delta;
surfaces confidence (n = repeats × cases) so a thin sample is visually flagged.

All of this reads exclusively through the ExperimentStore facade — no direct SQL in the console layer.

8. Configuration

ExperimentConfig lives parallel to EvolutionConfig in src/config.ts, defaulted in DEFAULT_CONFIG, merged from disk in the same loadConfig block, and bounds-clamped like the eval-daemon knobs.

interface ExperimentConfig {
  enabled: boolean;            // default false (master switch)
  defaultRepeats: number;      // default 3      (clamp 1..50)
  maxVariants: number;         // default 6      (clamp 1..20)
  perTrialTimeoutMs: number;   // default 120000 (clamp 1000..600000)
  maxTrialsPerRun: number;     // default 200    (clamp 1..5000)
}

maxTrialsPerRun is the safety valve: variants × cases × repeats is rejected by the Runner before launch if it exceeds the cap, preventing a runaway fan-out.

9. Phased plan

Phase	Deliverable	State
P0-A	Deterministic core: DDL migration, `ExperimentStore` (+ rows), `IntelDatabase` facade, `ExperimentConfig`, tests, this RFC.	Landed
P0-B	`stats` aggregation helpers + report data shape (pure functions over `listTrials`).	Shipped
P1	Real runs: `ExperimentRunner` + `LiveVariantExecutor` (wraps `FleetManager.spawnWorker`) + graders (`heuristic`, `verifier`, `judge`) + token cost caps.	Shipped
P2	`/experiment` command (create/run/report/list/dataset export/promote) + `*.experiment.md` parser.	Shipped
P3	FleetConsole Experiments channel + comparison BarChart (via `broadcastFleetExperiment` frame).	Shipped
P4	Promotion (`/experiment promote` → existing `promoteShadow`/evolution pipeline) + optional `ExperimentScheduler`-driven background runs.	Shipped

Status (2026-06-19): all phases P0–P4 shipped to master (PRs #573–#581 in agents-fleet, PR #2 in agents-fleet-console), CI green. Follow-ups also landed: hardening (#574), architect recs L3/L4/M1 + deep trial-tagging (#577), auto-labeled outcomes at collection + expectOutcome deterministic grading (#578). The live-LLM path is verified by the gated evolution e2e (real +0.54 quality lift).

Each phase is independently shippable; P0-A introduces zero behavior change (the framework is enabled: false by default and nothing dispatches to it yet).

10. Non-goals (for now)

Cross-session statistical significance testing beyond a transparent, variance-aware meetsImprovementThreshold heuristic (renamed from the earlier, over-claiming meetsSignificance — M3). It gates on lift > noiseFloor, lift >= minImprovement, and pooledStddev === 0 || lift >= significanceK × pooledStddev (default significanceK = 1.0); it is not a p-value / t-test. The report's winnerSignificant flag (M5) is computed by this same heuristic over the winner-vs-baseline scores.
Distributed / multi-host trial execution — runs are in-process via the existing worker/eval machinery.
Persisting derived Reports — they are recomputed from trials on demand; trials are the durable source of truth.
Auto-generating variants (a future tie-in with the evolution proposer).

RFC: Experimentation Framework for agents-fleet ​

1. Motivation ​

2. Concept model ​

3. Architecture (reuse-first) ​

4. Data model (DDL) ​

Hardening migrations (post-P0 architecture review) ​

Design notes ​

5. The ExperimentStore API (P0-A — landed) ​

6. The Grader abstraction ​

6.1 Shared Verdict type (arch rec M1) ​

7. CLI / console surface ​

/experiment slash command (P2) ​

*.experiment.md format (mirrors *.skill.md) ​

FleetConsole "Experiments" channel (P3) ​

8. Configuration ​

9. Phased plan ​

10. Non-goals (for now) ​