E2E testing

End-to-end tests run the built agents-fleet CLI from dist\index.js against isolated synthetic git repositories under .e2e-runs\. They validate process startup, slash-command routing, provider-matrix behavior, token usage reporting, token budget enforcement, FLEET_HOME isolation, artifact capture, and cleanup from outside the CLI.

Quick start
Scenario selection
Complete scenario inventory
Harness workspaces and cleanup
Token tracking and budgets
Provider matrix and auth
Writing a scenario
Cross-platform notes
Troubleshooting

Quick start

Build first; e2e runs always target the packaged binary, not tsx:

powershell

npm run build

Then use one of the four common loops:

powershell

npm run test:e2e:default   # no-LLM sanity: smoke + slash-command-help only
npm run test:e2e:real-llm  # opt-in: all scenarios, including provider-backed runs
npm run test:e2e:keep      # debug: all scenarios and preserve .e2e-runs\ artifacts

# opt-in: just the Tier 3 workflow scenarios
$env:AGENTS_FLEET_E2E_RUN_WORKFLOWS = '1'
npm run test:e2e

Do not use the real-LLM scripts unless you intentionally want token-spending provider runs. test:e2e:keep sets both AGENTS_FLEET_E2E_RUN_FEATURE=1 and KEEP_E2E_ARTIFACTS=1. The legacy AGENTS_FLEET_E2E_RUN_FEATURE=1 gate still works for /feature-specific scenarios.

Scenario selection

The base script still runs the e2e Vitest config directly:

powershell

npm run test:e2e

With no opt-in environment variables, real-LLM scenarios are discovered but skipped. Target a single file or test name with Vitest filters:

powershell

npm run test:e2e -- smoke.e2e
npm run test:e2e -- slash-command-help
npm test -- harness.test

Enable the opt-in scenarios manually when you need finer control than the convenience scripts:

powershell

$env:AGENTS_FLEET_E2E_RUN_FEATURE = '1'
npm run test:e2e -- feature-no-hang

Gate hierarchy: AGENTS_FLEET_E2E_RUN_FEATURE=1 is the legacy PR-4 gate for /feature scenarios. AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 is the umbrella gate for all Tier 3 opt-in workflows, including /feature, /code-review-range, /crew generate, /brainstorm, and /init. Either gate enables all opt-in scenarios today; future workflow scenarios should opt into the WORKFLOWS gate.

Complete scenario inventory

The suite shipped its v2 coverage expansion across PRs #160 (Tier 1 cheap security), #159 (Tier 2 cheap completeness), and the PR-C close-out (Tier 3 opt-in workflows). Coverage went from ~7% to ~80% of the user-visible feature surface.

Default-on, no LLM (12 scenarios)

Scenario	Validates	Source
`smoke`	Built binary starts and `/version` exits with a semver-like version.	`tests\e2e\scenarios\smoke.e2e.test.ts`
`slash-command-help`	`/help` and `/help /feature` route through single-shot slash-command dispatch for both provider flags without auth.	`tests\e2e\scenarios\slash-command-help.e2e.test.ts`
`diagnose-permission-posture`	`/diagnose` text + `/diagnose --json` permission posture contract from #88.	`tests\e2e\scenarios\diagnose-permission-posture.e2e.test.ts`
`yolo-ack-guard`	`--yolo` alone non-zero exit; `--acknowledge-all-sec` activates acks with source=`cli` (regression detector for #88 breaking change).	`tests\e2e\scenarios\yolo-ack-guard.e2e.test.ts`
`yolo-env-legacy`	`AGENTS_FLEET_YOLO_LEGACY=1` escape hatch activates all acks with source=`env-legacy`.	`tests\e2e\scenarios\yolo-env-legacy.e2e.test.ts`
`permissions-config`	`~/.fleet/config.json` permissions block + `yoloAcks` + `allowPrivilegeEscalation` reflected in `/diagnose --json`.	`tests\e2e\scenarios\permissions-config.e2e.test.ts`
`crew-local-commands`	`/crew list/create/activate/stop` + `/crews` cheap empty-state path; project `.fleet\crews\` only.	`tests\e2e\scenarios\crew-local-commands.e2e.test.ts`
`crew-bundle-mcp`	`/crew mcp` add/remove + `/crew export`/`import` happy and deterministic negative path.	`tests\e2e\scenarios\crew-bundle-mcp.e2e.test.ts`
`local-registry-smoke`	Batch cheap empty-state slash commands all exit 0 with no Unknown command.	`tests\e2e\scenarios\local-registry-smoke.e2e.test.ts`
`cli-help-and-startup-flags`	`--help`, `--version`, `--no-coordinator`, `--name`, `--effort`, `--model`, `--max-workers` visible behavior.	`tests\e2e\scenarios\cli-help-and-startup-flags.e2e.test.ts`
`help-major-families`	`/help /crew`, `/help /diagnose`, `/help /init`, `/help /code-review`, `/help /mcp`, `/help /skill` usage text.	`tests\e2e\scenarios\help-major-families.e2e.test.ts`
`config-negative-cases`	Malformed `~/.fleet/config.json` (invalid JSON, wrong shape, wrong type) warns + defaults safely.	`tests\e2e\scenarios\config-negative-cases.e2e.test.ts`
`knowledge-local-commands`	`/skills`, `/skill create/list`, `agents-fleet skills migrate --dry-run`.	`tests\e2e\scenarios\knowledge-local-commands.e2e.test.ts`
`git-and-state-local-commands`	`/diff`, `/worktree`, `/context-status/scopes` (was `/init status/scopes`), `/loops`, `/compete` empty/fixture paths in an init'd git repo.	`tests\e2e\scenarios\git-and-state-local-commands.e2e.test.ts`
`fleet-home-not-polluted`	The harness's HOME/USERPROFILE/FLEET_HOME isolation actually prevents writes to the real `~/.fleet/`.	`tests\e2e\scenarios\fleet-home-not-polluted.e2e.test.ts`

Opt-in, real LLM (10 scenarios — gated by env vars)

Gated by EITHER AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 (umbrella for all opt-in workflow scenarios) OR AGENTS_FLEET_E2E_RUN_FEATURE=1 (legacy gate from PR-4; both kept for compatibility).

Scenario	Cost	Validates	Source
`code-review-basic`	Low/medium	`/code-review src/index.ts` dispatches and catches durability regressions.	`tests\e2e\scenarios\code-review-basic.e2e.test.ts`
`feature-planning-clean`	High	#131: `/feature --new` writes planning artifacts while leaving `src\` clean.	`tests\e2e\scenarios\feature-planning-clean.e2e.test.ts`
`feature-no-hang`	High	#132: unattended `/feature --new` exits after all required artifacts are durable.	`tests\e2e\scenarios\feature-no-hang.e2e.test.ts`
`feature-no-contradictions`	High	#133: later `/feature` artifacts preserve requirements unless an explicit amendment records a scope change.	`tests\e2e\scenarios\feature-no-contradictions.e2e.test.ts`
`code-review-git-range`	Low/medium	PR #149 (GAP-4): `/code-review HEAD~1..HEAD` expands the range and writes `CODE_REVIEW.md` referencing the changed file.	`tests\e2e\scenarios\code-review-git-range.e2e.test.ts`
`feature-resume`	Very high	`/feature --resume <slug>` reuses the existing slug (no new plan dir) and keeps artifacts durable.	`tests\e2e\scenarios\feature-resume.e2e.test.ts`
`feature-from-brainstorm`	High	`/feature --from-brainstorm <slug>` seeds Phase 1 from the brainstorm context.	`tests\e2e\scenarios\feature-from-brainstorm.e2e.test.ts`
`feature-deep-brainstorm`	Low	`/feature --deep-brainstorm` under `--unattended` is refused with a clear message.	`tests\e2e\scenarios\feature-deep-brainstorm.e2e.test.ts`
`crew-generate`	Low	#114 regression detector: `/crew generate <name> --prompt "x"` produces a `<name>.crew.md` shaped by the prompt.	`tests\e2e\scenarios\crew-generate.e2e.test.ts`
`brainstorm-standalone`	Low	`/brainstorm "x" --rounds 1 --agents 0` writes `00-brainstorming.md`.	`tests\e2e\scenarios\brainstorm-standalone.e2e.test.ts`
`init-investigation`	Medium	`/init` writes `.fleet/context/FLEET.md`.	`tests\e2e\scenarios\init-investigation.e2e.test.ts`

Tier 3 opt-in scenarios pin to the Copilot provider only (token budget). The provider-parity unit + integration tests already validate provider equivalence on the permission-gate path. Tier 1 default-on scenarios that exercise local command routing (not LLM behavior) run both Copilot + Claude provider flags.

Harness workspaces and cleanup

Each scenario gets a repo-local workspace under .e2e-runs\<scenario-name>-<timestamp>\:

project\ — synthetic git repository used as the CLI cwd
artifacts\ — harness artifact directory and token result file location
home\.fleet\ — isolated fleet home exposed through FLEET_HOME

Workspaces are deleted after both pass and failure by default. Set KEEP_E2E_ARTIFACTS=1 (or use npm run test:e2e:keep) to retain them for inspection. Cleanup uses bounded retries and logs cleanup-retained-locked-files if Windows keeps a file locked.

Token tracking and budgets

The harness attaches token usage to every RunResult:

result.tokens.totalTokens;
result.tokens.usageSource; // 'e2e-result-file', 'session-file', or 'none'

Budgets are unlimited by default. Configure caps through scenario tokenBudget values or environment variables:

AGENTS_FLEET_E2E_MAX_TOTAL_TOKENS
AGENTS_FLEET_E2E_MAX_INPUT_TOKENS
AGENTS_FLEET_E2E_MAX_OUTPUT_TOKENS
AGENTS_FLEET_E2E_<SCENARIO_NAME>_MAX_TOTAL_TOKENS

Provider-matrix runs enforce budgets per (scenario, provider) tuple by resolving names like feature-no-hang-copilot before applying overrides. Exceeding a configured cap fails with token-budget-exceeded and prints the cap, actual usage, and usage source.

Provider matrix and auth

Provider-aware scenarios use runAgentsFleetProviderMatrix(). For each row the harness injects --provider <provider> using child_process.spawn() arg arrays, reports passed/failed/skipped, and prints a one-line summary with timeout and token budget.

Provider availability is checked before real provider rows run:

Copilot requires @github/copilot-sdk, gh on PATH, and gh auth status succeeding. Run gh auth login if needed.
Claude requires @anthropic-ai/claude-agent-sdk plus one of ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_AUTH_TOKEN, an existing Claude credentials file, or claude auth status succeeding.

Missing auth skips only that provider row unless the scenario explicitly overrides availability for no-LLM local command coverage.

Writing a scenario

Add scenario files under tests\e2e\scenarios\ with the .e2e.test.ts suffix and import helpers from ../harness.js:

import { describe, expect, it } from 'vitest';
import { runAgentsFleet, withTempProject } from '../harness.js';

describe('my scenario', () => {
  it('does the thing', async () => {
    await withTempProject('my-scenario', async (ctx) => {
      const result = await runAgentsFleet({
        args: ['--unattended', '-p', '/version'],
        timeoutMs: 30_000,
      });

      expect(result.exitCode).toBe(0);
      expect(result.testDir).toBe(ctx.projectDir);
    });
  });
});

Use named withTempProject('scenario-name', ...) calls so artifact paths and token-budget environment overrides stay stable. New real-LLM scenarios must be opt-in; do not add default-on token-spending tests.

Cross-platform notes

Harness filesystem paths use path.join()/path.resolve() and native absolute paths.
CLI and git invocations use child_process.spawn(command, args) with no shell: true, avoiding PowerShell/Bash/zsh quoting differences.
Use normalizeOutputForAssertions() for stdout/stderr checks that compare paths or multiline output. It strips ANSI, normalizes EOLs, redacts scenario roots, and normalizes path separators in assertion strings.
Write fixtures with explicit LF (\n) and accept either LF or CRLF in generated files.

Troubleshooting

Scenario discovered but skipped: opt-in scenarios require AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 or AGENTS_FLEET_E2E_RUN_FEATURE=1. Use npm run test:e2e:real-llm or set either env var manually.
Scenario times out: increase AGENTS_FLEET_E2E_FEATURE_TIMEOUT_MS for /feature scenarios, for example $env:AGENTS_FLEET_E2E_FEATURE_TIMEOUT_MS = '1800000' for 30 minutes.
Real ~\.fleet\ was polluted: file a bug. E2e runs should isolate state through FLEET_HOME, HOME, and USERPROFILE inside .e2e-runs\.
[PLANNING-DEBUG] lines on stderr: diagnostic planning logging is enabled. The lines are harmless and useful when debugging permission gates.
Need to inspect failure artifacts: rerun with KEEP_E2E_ARTIFACTS=1 or npm run test:e2e:keep; failure output includes retained workspace/artifact paths when available.

E2E testing ​

Table of contents ​

Quick start ​

Scenario selection ​

Complete scenario inventory ​

Default-on, no LLM (12 scenarios) ​

Opt-in, real LLM (10 scenarios — gated by env vars) ​

Harness workspaces and cleanup ​

Token tracking and budgets ​

Provider matrix and auth ​

Writing a scenario ​

Cross-platform notes ​

Troubleshooting ​

E2E testing

Table of contents

Quick start

Scenario selection

Complete scenario inventory

Default-on, no LLM (12 scenarios)

Opt-in, real LLM (10 scenarios — gated by env vars)

Harness workspaces and cleanup

Token tracking and budgets

Provider matrix and auth

Writing a scenario

Cross-platform notes

Troubleshooting