Skip to content

RFC: Experimentation Framework for agents-fleet

Draft / RFC

This is a design RFC, not shipped behavior. Only Phase 0-A (definition + persistence) has landed. Interfaces and flags described below may change.

  • Status: Draft (Phase 0-A landed: definition + persistence)
  • Owner: Fleet platform
  • Scope: A first-class way to define, run, grade, and compare variants of a skill / role / crew against a fixed task suite, then promote the winner.

1. Motivation

agents-fleet already has an evolution subsystem (shadow skills, EvalRunner, EvalScheduler, EvalJudge) that answers a narrow question: "is this one auto-proposed shadow better than the current skill?" It is single-variant, single-candidate, and tightly coupled to the shadow lifecycle.

What we do not have is a general A/B/n experimentation primitive:

Given N hand-authored or generated variants of a skill/role/crew, run each over a shared, versioned task suite, grade every trial with a pluggable grader, repeat for statistical confidence, and produce a comparison report with a recommended promotion.

This RFC defines that primitive. It is reuse-first: wherever the evolution subsystem already solved a sub-problem (sandboxed runs, verification, judging, scheduling, daemon control), the experiment framework composes it rather than reimplementing it.

Reuse, honestly: at P0 the framework achieves ~30% code reuse (shared stats math, row/store conventions, config-clamp patterns). The ~80% reuse target — wrapping EvalRunner, VerifierNode, EvalJudge, EvalScheduler wholesale — is a P4 goal, not a P0 reality. The Grader abstraction is a thin composition adapter over those pieces, not a unification of them: each grader delegates to an existing primitive behind a common grade() signature; it does not merge their internals.


2. Concept model

Experiment
 ├─ id, name, kind: 'skill' | 'role' | 'crew'
 ├─ Variants[]            # the things being compared (baseline + candidates)
 │   └─ variantId, label, source (skill body / role md / crew spec / patch)
 ├─ TaskSuite             # a versioned set of Cases the variants run against
 │   └─ Cases[]           # { caseId, prompt, expected?, criteria? }
 ├─ Grader                # how a single trial's output → score + passed
 └─ Config                # repeats, maxVariants, perTrialTimeoutMs, ...

ExperimentRun             # one execution of an Experiment
 └─ Trial[]               # one per (variant × case × repeatIdx)
     └─ score, passed, grader, durationMs, tokensIn/out, outputHash, metadata

Report                    # derived: per-variant aggregates + winner + deltas
  • An Experiment is the immutable-ish definition (its spec is JSON).
  • A Variant is one contestant. The baseline (current skill/role/crew) is always variant 0 so every report has a reference point.
  • A TaskSuite is a named, versioned bag of Cases. Versioning lets a report cite exactly which cases produced it; re-running after editing a suite bumps the version rather than mutating history.
  • A Case is a single prompt plus optional expected answer-key and criteria (grader hints). Both are free-form JSON.
  • A Grader turns one trial's raw output into { score, passed }.
  • An ExperimentRun is one execution; it fans out into Trials, the atomic graded unit, keyed by (variantId, caseId, repeatIdx).
  • A Report is derived (not persisted in P0-A) by aggregating trials.

3. Architecture (reuse-first)

ComponentResponsibilityReuses / mirrors
ExperimentStorePersist experiments, suites, cases, runs, trials.Mirrors EvalRunStore / EvolutionRunStore (*.rows.ts + prepared-statement store); registered on IntelDatabase. Uses eval_runs-style table conventions.
ExperimentRunnerFan out (variant × case × repeat) → execute → grade → persist trials.Wraps EvalRunner for sandboxed execution + VerifierNode for command checks.
VariantProvisionerMaterialize a variant (skill body / role / crew) into a runnable form.Reuses ShadowStore provisioning + createEvolverFixture test scaffolding.
Gradergrade(trial) → { score, passed }. Pluggable strategies.Reuses EvalJudge (LLM verdict) and VerifierNode (deterministic command pass/fail).
statsAggregate trials → per-variant mean/pass-rate + winner + deltas.Pure functions; no new deps. Mirrors evolution win-rate math.
schedulerOptionally run experiments in the background, budgeted.Reuses EvalScheduler + DaemonControlChannel (same budget/tick model).
command/experiment slash command surface.Mirrors /learn command structure + *.experiment.md like *.skill.md.
consoleFleetConsole "Experiments" channel + comparison BarChart.Reuses existing console channel + chart widgets.
configexperiment sub-config block.Parallel to EvolutionConfig; bounds-clamped in loadConfig.

The dependency direction is strictly: command/console → Runner → {Provisioner, Grader, ExperimentStore} → IntelDatabase. The store layer (this RFC's P0-A) has no upward dependencies, which is why it can land and be tested first.


4. Data model (DDL)

Added via a single idempotent SchemaManager migration (create_experiment_framework), registered in the schema_migrations ledger so existing DBs upgrade on next open. All tables/indexes use IF NOT EXISTS.

sql
experiments(
  id TEXT PRIMARY KEY,
  name TEXT NOT NULL,
  kind TEXT NOT NULL CHECK(kind IN ('skill','role','crew')),
  spec_json TEXT NOT NULL,
  status TEXT NOT NULL CHECK(status IN ('draft','ready','running','complete','cancelled')),
  created_at INTEGER NOT NULL,
  suite_id TEXT                 -- denormalized from spec.suiteId for suite lookups (M2)
);

experiment_task_suites(
  suite_id TEXT PRIMARY KEY,
  name TEXT NOT NULL,
  version INTEGER NOT NULL,
  created_at INTEGER NOT NULL
);

experiment_cases(
  suite_id TEXT NOT NULL,
  case_id TEXT NOT NULL,
  prompt TEXT NOT NULL,
  expected_json TEXT,
  criteria_json TEXT,
  PRIMARY KEY (suite_id, case_id)
);

experiment_runs(
  run_id TEXT PRIMARY KEY,
  experiment_id TEXT NOT NULL,
  started_at INTEGER NOT NULL,
  completed_ts INTEGER,
  status TEXT NOT NULL CHECK(status IN ('running','complete','cancelled','error'))
);

experiment_trials(
  trial_id TEXT PRIMARY KEY,
  run_id TEXT NOT NULL,
  variant_id TEXT NOT NULL,
  case_id TEXT NOT NULL,
  repeat_idx INTEGER NOT NULL,
  score REAL,
  passed INTEGER,            -- 0/1, NULL = ungraded
  grader TEXT,
  duration_ms INTEGER,
  tokens_in INTEGER,
  tokens_out INTEGER,
  output_hash TEXT,
  metadata_json TEXT
);

-- Indexes
CREATE INDEX        idx_experiment_runs_experiment      ON experiment_runs(experiment_id, started_at DESC);
CREATE INDEX        idx_experiment_trials_run           ON experiment_trials(run_id);
CREATE UNIQUE INDEX idx_experiment_trials_identity      ON experiment_trials(run_id, variant_id, case_id, repeat_idx);  -- H1: duplicate-trial guard
CREATE INDEX        idx_experiment_cases_suite          ON experiment_cases(suite_id);
CREATE INDEX        idx_experiments_suite               ON experiments(suite_id);                                       -- M2

-- Production-analytics tagging (L2): forward-prep only, NULL = production.
ALTER TABLE skill_outcomes ADD COLUMN experiment_run_id TEXT;
CREATE INDEX idx_skill_outcomes_experiment_run ON skill_outcomes(experiment_run_id);

Hardening migrations (post-P0 architecture review)

These ship as additional idempotent steps in the schema_migrations ledger, so a fresh db and an existing db both converge to the same shape:

  • create_experiment_trials_unique — replaces the non-unique (run_id, variant_id) index with a UNIQUE index on the full trial identity (run_id, variant_id, case_id, repeat_idx). The store writes trials with INSERT OR REPLACE, so re-grading a coordinate overwrites in place rather than duplicating. The old idx_experiment_trials_run_variant is dropped (its prefix is covered by the unique index).
  • experiments_add_suite_id / idx_experiments_suite — adds + indexes the suite_id column, populated from spec.suiteId on create; surfaced via listExperimentsBySuite.
  • skill_outcomes_add_experiment_run_id / idx_skill_outcomes_experiment_run — forward-prep tag to exclude experiment-driven trials from production analytics (NULL = production). No writer wiring yet.

Design notes

  • spec_json holds the full experiment definition (variants, grader choice, config overrides). Keeping it as opaque JSON means variant/grader schema can evolve without further migrations. The in-memory spec is the typed ExperimentSpec (H3) — the JSON is given structure at the store boundary via a cast (there is no runtime .parse(); the type is a compile-time contract only):

    ts
    interface ExperimentSpec {
      variants: { id: string; label: string; patch?: string; roleName?: string; skills?: string[] }[];
      grader: 'heuristic' | 'verifier' | 'judge';
      suiteId: string;
      repeats?: number;
      baselineVariantId?: string;
      config?: Partial<ExperimentConfig>;
    }
  • passed as INTEGER (not BOOLEAN — SQLite has none) with NULL reserved for ungraded trials, distinct from 0 (graded-and-failed). The store maps null → undefined, 1 → true, 0 → false.

  • output_hash lets the report flag flaky / identical outputs across repeats without storing raw (potentially large / sensitive) output text.

  • Repeats are first-class via repeat_idx; the unique trial identity is (run_id, variant_id, case_id, repeat_idx), enforced by the UNIQUE index idx_experiment_trials_identity (H1) — not merely by the minted trial_id.

  • No FKs between experiment_runs.experiment_id and experiments — matching the loose-coupling convention of eval_runs/evolution_runs, so a pruned experiment never cascades away its historical run telemetry.


5. The ExperimentStore API (P0-A — landed)

Typed, camelCase records with JSON parse/stringify for *_json columns. Every write is best-effort (try/catch swallow); every read returns a sensible default (null / []) so a corrupt row never crashes a caller. Prepared statements are stored as class properties.

Write semantics (H2): the root entities experiments, experiment_runs, experiment_task_suites use INSERT OR IGNORE — a second create with an existing id is a no-op, never a silent overwrite of historical state. The append-style children experiment_cases and experiment_trials keep INSERT OR REPLACE (re-grading a coordinate / re-adding a case is expected).

Lifecycle ownership (M6): the ExperimentRunner owns run status only (createRun / completeRun); it does not mutate the experiments.status definition row. Callers derive a live experiment status from getExperimentStatus(experimentId) (the latest run's status) or own experiments.status at the create/command layer.

ts
// Experiments
createExperiment(rec: ExperimentRecord): void        // INSERT OR IGNORE — never clobbers (H2)
getExperiment(id: string): ExperimentRecord | null
listExperiments(): ExperimentRecord[]                       // newest-first
listExperimentsBySuite(suiteId: string): ExperimentRecord[] // newest-first (M2)
updateExperimentStatus(id, status: ExperimentStatus): void
getExperimentStatus(experimentId): ExperimentRunStatus | null  // latest run's status (M6)

// Suites / cases
createTaskSuite(rec: TaskSuiteRecord): void
getTaskSuite(suiteId: string): TaskSuiteRecord | null
addCase(rec: CaseRecord): void
listCases(suiteId: string): CaseRecord[]

// Runs
createRun(rec: ExperimentRunRecord): void
completeRun(runId, status: ExperimentRunStatus, completedTs?): void
getRun(runId: string): ExperimentRunRecord | null
listRunsForExperiment(experimentId: string): ExperimentRunRecord[]   // newest-first

// Trials
insertTrial(rec: TrialRecord): void
listTrials(runId: string): TrialRecord[]                    // (variant, case, repeat) order
listTrialsByVariant(runId, variantId: string): TrialRecord[]

String-literal-union types (no enums):

ts
type ExperimentKind      = 'skill' | 'role' | 'crew';
type ExperimentStatus    = 'draft' | 'ready' | 'running' | 'complete' | 'cancelled';
type ExperimentRunStatus = 'running' | 'complete' | 'cancelled' | 'error';

IntelDatabase composes ExperimentStore next to the other stores and exposes a thin facade (createExperiment, recordExperimentTrial, listExperimentTrials, completeExperimentRun, …) mirroring how recordEvalRun is surfaced.


6. The Grader abstraction

A grader maps one trial to a verdict:

ts
interface Grader {
  readonly name: string;                       // persisted into trial.grader
  grade(input: {
    case: CaseRecord;
    output: string;
    durationMs: number;
  }): Promise<{ score: number; passed: boolean; metadata?: Record<string, unknown> }>;
}

Planned built-ins (P1), all composed from existing pieces:

  • exact / contains — deterministic string match against case.expected.
  • verifier — runs case.criteria.commands through VerifierNode; passed = all commands exit 0. Reuses the evolution verifier wholesale.
  • judge — delegates to EvalJudge for an LLM rubric verdict, mapping its JudgeVerdict to { score, passed }.

grader is stored per-trial (not per-experiment) so a single run can mix graders (e.g. cheap deterministic gate first, LLM judge only on ties).

6.1 Shared Verdict type (arch rec M1)

The review noted that the framework carried four disconnected result shapes — TrialScore (runner grade), JudgeResult (EvalJudge), VerifierResult (VerifierNode), and GateResult (validation gates) — and that the Grader interface is a composition adapter, not a unification: it wraps each scorer behind grade() but never gives the four shapes a common type.

src/intel/experiment/verdict.ts supplies that common shape without breaking any producer:

ts
interface Verdict { passed: boolean; score: number; /* 0..1 */ evidence?: string }
  • TrialScore now extends Verdict (its { score, passed, evidence } fields are unchanged, so all existing usage compiles untouched).
  • Pure adapters map the other three shapes into a Verdict: verdictFromJudge(j, { aIsVariant? }), verdictFromVerifier(v), verdictFromGate(g).

The underlying JudgeResult / VerifierResult / GateResult source types are left unchanged — this is a lightweight, non-breaking unification that lets future framework code treat any grader output uniformly via Verdict without refactoring the producers today.


7. CLI / console surface

/experiment slash command (P2)

/experiment create <file.experiment.md>     # register an experiment from spec
/experiment run <experimentId>              # execute (fan out → grade → persist)
/experiment list                            # table of experiments + status
/experiment report <runId>                  # per-variant comparison + winner
/experiment cancel <runId>                  # mark run cancelled, stop scheduling
/experiment promote <runId> <variantId>     # apply winner via existing promotion

*.experiment.md format (mirrors *.skill.md)

markdown
---
name: tighten-coder-skill
kind: skill
suite: core-coding-cases
grader: verifier
config: { repeats: 3, maxVariants: 4 }
variants:
  - id: baseline            # variant 0 = current skill
  - id: terse-v1
    patch: variants/terse-v1.md
  - id: checklist-v1
    patch: variants/checklist-v1.md
---
# Tighten coder skill
Free-form rationale / notes for the experiment.

The YAML front-matter is parsed into spec_json; the body is documentation.

FleetConsole "Experiments" channel (P3)

A dedicated console channel that:

  • lists active/recent runs with live status + progress (trials done / total);
  • renders a comparison BarChart of per-variant mean score (and a second of pass-rate), baseline highlighted, winner annotated with the delta;
  • surfaces confidence (n = repeats × cases) so a thin sample is visually flagged.

All of this reads exclusively through the ExperimentStore facade — no direct SQL in the console layer.


8. Configuration

ExperimentConfig lives parallel to EvolutionConfig in src/config.ts, defaulted in DEFAULT_CONFIG, merged from disk in the same loadConfig block, and bounds-clamped like the eval-daemon knobs.

ts
interface ExperimentConfig {
  enabled: boolean;            // default false (master switch)
  defaultRepeats: number;      // default 3      (clamp 1..50)
  maxVariants: number;         // default 6      (clamp 1..20)
  perTrialTimeoutMs: number;   // default 120000 (clamp 1000..600000)
  maxTrialsPerRun: number;     // default 200    (clamp 1..5000)
}

maxTrialsPerRun is the safety valve: variants × cases × repeats is rejected by the Runner before launch if it exceeds the cap, preventing a runaway fan-out.


9. Phased plan

PhaseDeliverableState
P0-ADeterministic core: DDL migration, ExperimentStore (+ rows), IntelDatabase facade, ExperimentConfig, tests, this RFC.Landed
P0-Bstats aggregation helpers + report data shape (pure functions over listTrials).Shipped
P1Real runs: ExperimentRunner + LiveVariantExecutor (wraps FleetManager.spawnWorker) + graders (heuristic, verifier, judge) + token cost caps.Shipped
P2/experiment command (create/run/report/list/dataset export/promote) + *.experiment.md parser.Shipped
P3FleetConsole Experiments channel + comparison BarChart (via broadcastFleetExperiment frame).Shipped
P4Promotion (/experiment promote → existing promoteShadow/evolution pipeline) + optional ExperimentScheduler-driven background runs.Shipped

Status (2026-06-19): all phases P0–P4 shipped to master (PRs #573–#581 in agents-fleet, PR #2 in agents-fleet-console), CI green. Follow-ups also landed: hardening (#574), architect recs L3/L4/M1 + deep trial-tagging (#577), auto-labeled outcomes at collection + expectOutcome deterministic grading (#578). The live-LLM path is verified by the gated evolution e2e (real +0.54 quality lift).

Each phase is independently shippable; P0-A introduces zero behavior change (the framework is enabled: false by default and nothing dispatches to it yet).


10. Non-goals (for now)

  • Cross-session statistical significance testing beyond a transparent, variance-aware meetsImprovementThreshold heuristic (renamed from the earlier, over-claiming meetsSignificance — M3). It gates on lift > noiseFloor, lift >= minImprovement, and pooledStddev === 0 || lift >= significanceK × pooledStddev (default significanceK = 1.0); it is not a p-value / t-test. The report's winnerSignificant flag (M5) is computed by this same heuristic over the winner-vs-baseline scores.
  • Distributed / multi-host trial execution — runs are in-process via the existing worker/eval machinery.
  • Persisting derived Reports — they are recomputed from trials on demand; trials are the durable source of truth.
  • Auto-generating variants (a future tie-in with the evolution proposer).