E2E testing
End-to-end tests run the built agents-fleet CLI from dist\index.js against isolated synthetic git repositories under .e2e-runs\. They validate process startup, slash-command routing, provider-matrix behavior, token usage reporting, token budget enforcement, FLEET_HOME isolation, artifact capture, and cleanup from outside the CLI.
Table of contents
- Quick start
- Scenario selection
- Complete scenario inventory
- Harness workspaces and cleanup
- Token tracking and budgets
- Provider matrix and auth
- Writing a scenario
- Cross-platform notes
- Troubleshooting
Quick start
Build first; e2e runs always target the packaged binary, not tsx:
npm run buildThen use one of the four common loops:
npm run test:e2e:default # no-LLM sanity: smoke + slash-command-help only
npm run test:e2e:real-llm # opt-in: all scenarios, including provider-backed runs
npm run test:e2e:keep # debug: all scenarios and preserve .e2e-runs\ artifacts
# opt-in: just the Tier 3 workflow scenarios
$env:AGENTS_FLEET_E2E_RUN_WORKFLOWS = '1'
npm run test:e2eDo not use the real-LLM scripts unless you intentionally want token-spending provider runs. test:e2e:keep sets both AGENTS_FLEET_E2E_RUN_FEATURE=1 and KEEP_E2E_ARTIFACTS=1. The legacy AGENTS_FLEET_E2E_RUN_FEATURE=1 gate still works for /feature-specific scenarios.
Scenario selection
The base script still runs the e2e Vitest config directly:
npm run test:e2eWith no opt-in environment variables, real-LLM scenarios are discovered but skipped. Target a single file or test name with Vitest filters:
npm run test:e2e -- smoke.e2e
npm run test:e2e -- slash-command-help
npm test -- harness.testEnable the opt-in scenarios manually when you need finer control than the convenience scripts:
$env:AGENTS_FLEET_E2E_RUN_FEATURE = '1'
npm run test:e2e -- feature-no-hangGate hierarchy: AGENTS_FLEET_E2E_RUN_FEATURE=1 is the legacy PR-4 gate for /feature scenarios. AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 is the umbrella gate for all Tier 3 opt-in workflows, including /feature, /code-review-range, /crew generate, /brainstorm, and /init. Either gate enables all opt-in scenarios today; future workflow scenarios should opt into the WORKFLOWS gate.
Complete scenario inventory
The suite shipped its v2 coverage expansion across PRs #160 (Tier 1 cheap security), #159 (Tier 2 cheap completeness), and the PR-C close-out (Tier 3 opt-in workflows). Coverage went from ~7% to ~80% of the user-visible feature surface.
Default-on, no LLM (12 scenarios)
| Scenario | Validates | Source |
|---|---|---|
smoke | Built binary starts and /version exits with a semver-like version. | tests\e2e\scenarios\smoke.e2e.test.ts |
slash-command-help | /help and /help /feature route through single-shot slash-command dispatch for both provider flags without auth. | tests\e2e\scenarios\slash-command-help.e2e.test.ts |
diagnose-permission-posture | /diagnose text + /diagnose --json permission posture contract from #88. | tests\e2e\scenarios\diagnose-permission-posture.e2e.test.ts |
yolo-ack-guard | --yolo alone non-zero exit; --acknowledge-all-sec activates acks with source=cli (regression detector for #88 breaking change). | tests\e2e\scenarios\yolo-ack-guard.e2e.test.ts |
yolo-env-legacy | AGENTS_FLEET_YOLO_LEGACY=1 escape hatch activates all acks with source=env-legacy. | tests\e2e\scenarios\yolo-env-legacy.e2e.test.ts |
permissions-config | ~/.fleet/config.json permissions block + yoloAcks + allowPrivilegeEscalation reflected in /diagnose --json. | tests\e2e\scenarios\permissions-config.e2e.test.ts |
crew-local-commands | /crew list/create/activate/stop + /crews cheap empty-state path; project .fleet\crews\ only. | tests\e2e\scenarios\crew-local-commands.e2e.test.ts |
crew-bundle-mcp | /crew mcp add/remove + /crew export/import happy and deterministic negative path. | tests\e2e\scenarios\crew-bundle-mcp.e2e.test.ts |
local-registry-smoke | Batch cheap empty-state slash commands all exit 0 with no Unknown command. | tests\e2e\scenarios\local-registry-smoke.e2e.test.ts |
cli-help-and-startup-flags | --help, --version, --no-coordinator, --name, --effort, --model, --max-workers visible behavior. | tests\e2e\scenarios\cli-help-and-startup-flags.e2e.test.ts |
help-major-families | /help /crew, /help /diagnose, /help /init, /help /code-review, /help /mcp, /help /skill usage text. | tests\e2e\scenarios\help-major-families.e2e.test.ts |
config-negative-cases | Malformed ~/.fleet/config.json (invalid JSON, wrong shape, wrong type) warns + defaults safely. | tests\e2e\scenarios\config-negative-cases.e2e.test.ts |
knowledge-local-commands | /skills, /skill create/list, agents-fleet skills migrate --dry-run. | tests\e2e\scenarios\knowledge-local-commands.e2e.test.ts |
git-and-state-local-commands | /diff, /worktree, /context-status/scopes (was /init status/scopes), /loops, /compete empty/fixture paths in an init'd git repo. | tests\e2e\scenarios\git-and-state-local-commands.e2e.test.ts |
fleet-home-not-polluted | The harness's HOME/USERPROFILE/FLEET_HOME isolation actually prevents writes to the real ~/.fleet/. | tests\e2e\scenarios\fleet-home-not-polluted.e2e.test.ts |
Opt-in, real LLM (10 scenarios — gated by env vars)
Gated by EITHER AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 (umbrella for all opt-in workflow scenarios) OR AGENTS_FLEET_E2E_RUN_FEATURE=1 (legacy gate from PR-4; both kept for compatibility).
| Scenario | Cost | Validates | Source |
|---|---|---|---|
code-review-basic | Low/medium | /code-review src/index.ts dispatches and catches durability regressions. | tests\e2e\scenarios\code-review-basic.e2e.test.ts |
feature-planning-clean | High | #131: /feature --new writes planning artifacts while leaving src\ clean. | tests\e2e\scenarios\feature-planning-clean.e2e.test.ts |
feature-no-hang | High | #132: unattended /feature --new exits after all required artifacts are durable. | tests\e2e\scenarios\feature-no-hang.e2e.test.ts |
feature-no-contradictions | High | #133: later /feature artifacts preserve requirements unless an explicit amendment records a scope change. | tests\e2e\scenarios\feature-no-contradictions.e2e.test.ts |
code-review-git-range | Low/medium | PR #149 (GAP-4): /code-review HEAD~1..HEAD expands the range and writes CODE_REVIEW.md referencing the changed file. | tests\e2e\scenarios\code-review-git-range.e2e.test.ts |
feature-resume | Very high | /feature --resume <slug> reuses the existing slug (no new plan dir) and keeps artifacts durable. | tests\e2e\scenarios\feature-resume.e2e.test.ts |
feature-from-brainstorm | High | /feature --from-brainstorm <slug> seeds Phase 1 from the brainstorm context. | tests\e2e\scenarios\feature-from-brainstorm.e2e.test.ts |
feature-deep-brainstorm | Low | /feature --deep-brainstorm under --unattended is refused with a clear message. | tests\e2e\scenarios\feature-deep-brainstorm.e2e.test.ts |
crew-generate | Low | #114 regression detector: /crew generate <name> --prompt "x" produces a <name>.crew.md shaped by the prompt. | tests\e2e\scenarios\crew-generate.e2e.test.ts |
brainstorm-standalone | Low | /brainstorm "x" --rounds 1 --agents 0 writes 00-brainstorming.md. | tests\e2e\scenarios\brainstorm-standalone.e2e.test.ts |
init-investigation | Medium | /init writes .fleet/context/FLEET.md. | tests\e2e\scenarios\init-investigation.e2e.test.ts |
Tier 3 opt-in scenarios pin to the Copilot provider only (token budget). The provider-parity unit + integration tests already validate provider equivalence on the permission-gate path. Tier 1 default-on scenarios that exercise local command routing (not LLM behavior) run both Copilot + Claude provider flags.
Harness workspaces and cleanup
Each scenario gets a repo-local workspace under .e2e-runs\<scenario-name>-<timestamp>\:
project\— synthetic git repository used as the CLI cwdartifacts\— harness artifact directory and token result file locationhome\.fleet\— isolated fleet home exposed throughFLEET_HOME
Workspaces are deleted after both pass and failure by default. Set KEEP_E2E_ARTIFACTS=1 (or use npm run test:e2e:keep) to retain them for inspection. Cleanup uses bounded retries and logs cleanup-retained-locked-files if Windows keeps a file locked.
Token tracking and budgets
The harness attaches token usage to every RunResult:
result.tokens.totalTokens;
result.tokens.usageSource; // 'e2e-result-file', 'session-file', or 'none'Budgets are unlimited by default. Configure caps through scenario tokenBudget values or environment variables:
AGENTS_FLEET_E2E_MAX_TOTAL_TOKENSAGENTS_FLEET_E2E_MAX_INPUT_TOKENSAGENTS_FLEET_E2E_MAX_OUTPUT_TOKENSAGENTS_FLEET_E2E_<SCENARIO_NAME>_MAX_TOTAL_TOKENS
Provider-matrix runs enforce budgets per (scenario, provider) tuple by resolving names like feature-no-hang-copilot before applying overrides. Exceeding a configured cap fails with token-budget-exceeded and prints the cap, actual usage, and usage source.
Provider matrix and auth
Provider-aware scenarios use runAgentsFleetProviderMatrix(). For each row the harness injects --provider <provider> using child_process.spawn() arg arrays, reports passed/failed/skipped, and prints a one-line summary with timeout and token budget.
Provider availability is checked before real provider rows run:
- Copilot requires
@github/copilot-sdk,ghonPATH, andgh auth statussucceeding. Rungh auth loginif needed. - Claude requires
@anthropic-ai/claude-agent-sdkplus one ofANTHROPIC_API_KEY,CLAUDE_CODE_OAUTH_TOKEN,ANTHROPIC_AUTH_TOKEN, an existing Claude credentials file, orclaude auth statussucceeding.
Missing auth skips only that provider row unless the scenario explicitly overrides availability for no-LLM local command coverage.
Writing a scenario
Add scenario files under tests\e2e\scenarios\ with the .e2e.test.ts suffix and import helpers from ../harness.js:
import { describe, expect, it } from 'vitest';
import { runAgentsFleet, withTempProject } from '../harness.js';
describe('my scenario', () => {
it('does the thing', async () => {
await withTempProject('my-scenario', async (ctx) => {
const result = await runAgentsFleet({
args: ['--unattended', '-p', '/version'],
timeoutMs: 30_000,
});
expect(result.exitCode).toBe(0);
expect(result.testDir).toBe(ctx.projectDir);
});
});
});Use named withTempProject('scenario-name', ...) calls so artifact paths and token-budget environment overrides stay stable. New real-LLM scenarios must be opt-in; do not add default-on token-spending tests.
Cross-platform notes
- Harness filesystem paths use
path.join()/path.resolve()and native absolute paths. - CLI and git invocations use
child_process.spawn(command, args)with noshell: true, avoiding PowerShell/Bash/zsh quoting differences. - Use
normalizeOutputForAssertions()for stdout/stderr checks that compare paths or multiline output. It strips ANSI, normalizes EOLs, redacts scenario roots, and normalizes path separators in assertion strings. - Write fixtures with explicit LF (
\n) and accept either LF or CRLF in generated files.
Troubleshooting
- Scenario discovered but skipped: opt-in scenarios require
AGENTS_FLEET_E2E_RUN_WORKFLOWS=1orAGENTS_FLEET_E2E_RUN_FEATURE=1. Usenpm run test:e2e:real-llmor set either env var manually. - Scenario times out: increase
AGENTS_FLEET_E2E_FEATURE_TIMEOUT_MSfor/featurescenarios, for example$env:AGENTS_FLEET_E2E_FEATURE_TIMEOUT_MS = '1800000'for 30 minutes. - Real
~\.fleet\was polluted: file a bug. E2e runs should isolate state throughFLEET_HOME,HOME, andUSERPROFILEinside.e2e-runs\. [PLANNING-DEBUG]lines on stderr: diagnostic planning logging is enabled. The lines are harmless and useful when debugging permission gates.- Need to inspect failure artifacts: rerun with
KEEP_E2E_ARTIFACTS=1ornpm run test:e2e:keep; failure output includes retained workspace/artifact paths when available.