Skip to content

E2E testing

End-to-end tests run the built agents-fleet CLI from dist\index.js against isolated synthetic git repositories under .e2e-runs\. They validate process startup, slash-command routing, provider-matrix behavior, token usage reporting, token budget enforcement, FLEET_HOME isolation, artifact capture, and cleanup from outside the CLI.

Table of contents

Quick start

Build first; e2e runs always target the packaged binary, not tsx:

powershell
npm run build

Then use one of the four common loops:

powershell
npm run test:e2e:default   # no-LLM sanity: smoke + slash-command-help only
npm run test:e2e:real-llm  # opt-in: all scenarios, including provider-backed runs
npm run test:e2e:keep      # debug: all scenarios and preserve .e2e-runs\ artifacts

# opt-in: just the Tier 3 workflow scenarios
$env:AGENTS_FLEET_E2E_RUN_WORKFLOWS = '1'
npm run test:e2e

Do not use the real-LLM scripts unless you intentionally want token-spending provider runs. test:e2e:keep sets both AGENTS_FLEET_E2E_RUN_FEATURE=1 and KEEP_E2E_ARTIFACTS=1. The legacy AGENTS_FLEET_E2E_RUN_FEATURE=1 gate still works for /feature-specific scenarios.

Scenario selection

The base script still runs the e2e Vitest config directly:

powershell
npm run test:e2e

With no opt-in environment variables, real-LLM scenarios are discovered but skipped. Target a single file or test name with Vitest filters:

powershell
npm run test:e2e -- smoke.e2e
npm run test:e2e -- slash-command-help
npm test -- harness.test

Enable the opt-in scenarios manually when you need finer control than the convenience scripts:

powershell
$env:AGENTS_FLEET_E2E_RUN_FEATURE = '1'
npm run test:e2e -- feature-no-hang

Gate hierarchy: AGENTS_FLEET_E2E_RUN_FEATURE=1 is the legacy PR-4 gate for /feature scenarios. AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 is the umbrella gate for all Tier 3 opt-in workflows, including /feature, /code-review-range, /crew generate, /brainstorm, and /init. Either gate enables all opt-in scenarios today; future workflow scenarios should opt into the WORKFLOWS gate.

Complete scenario inventory

The suite shipped its v2 coverage expansion across PRs #160 (Tier 1 cheap security), #159 (Tier 2 cheap completeness), and the PR-C close-out (Tier 3 opt-in workflows). Coverage went from ~7% to ~80% of the user-visible feature surface.

Default-on, no LLM (12 scenarios)

ScenarioValidatesSource
smokeBuilt binary starts and /version exits with a semver-like version.tests\e2e\scenarios\smoke.e2e.test.ts
slash-command-help/help and /help /feature route through single-shot slash-command dispatch for both provider flags without auth.tests\e2e\scenarios\slash-command-help.e2e.test.ts
diagnose-permission-posture/diagnose text + /diagnose --json permission posture contract from #88.tests\e2e\scenarios\diagnose-permission-posture.e2e.test.ts
yolo-ack-guard--yolo alone non-zero exit; --acknowledge-all-sec activates acks with source=cli (regression detector for #88 breaking change).tests\e2e\scenarios\yolo-ack-guard.e2e.test.ts
yolo-env-legacyAGENTS_FLEET_YOLO_LEGACY=1 escape hatch activates all acks with source=env-legacy.tests\e2e\scenarios\yolo-env-legacy.e2e.test.ts
permissions-config~/.fleet/config.json permissions block + yoloAcks + allowPrivilegeEscalation reflected in /diagnose --json.tests\e2e\scenarios\permissions-config.e2e.test.ts
crew-local-commands/crew list/create/activate/stop + /crews cheap empty-state path; project .fleet\crews\ only.tests\e2e\scenarios\crew-local-commands.e2e.test.ts
crew-bundle-mcp/crew mcp add/remove + /crew export/import happy and deterministic negative path.tests\e2e\scenarios\crew-bundle-mcp.e2e.test.ts
local-registry-smokeBatch cheap empty-state slash commands all exit 0 with no Unknown command.tests\e2e\scenarios\local-registry-smoke.e2e.test.ts
cli-help-and-startup-flags--help, --version, --no-coordinator, --name, --effort, --model, --max-workers visible behavior.tests\e2e\scenarios\cli-help-and-startup-flags.e2e.test.ts
help-major-families/help /crew, /help /diagnose, /help /init, /help /code-review, /help /mcp, /help /skill usage text.tests\e2e\scenarios\help-major-families.e2e.test.ts
config-negative-casesMalformed ~/.fleet/config.json (invalid JSON, wrong shape, wrong type) warns + defaults safely.tests\e2e\scenarios\config-negative-cases.e2e.test.ts
knowledge-local-commands/skills, /skill create/list, agents-fleet skills migrate --dry-run.tests\e2e\scenarios\knowledge-local-commands.e2e.test.ts
git-and-state-local-commands/diff, /worktree, /context-status/scopes (was /init status/scopes), /loops, /compete empty/fixture paths in an init'd git repo.tests\e2e\scenarios\git-and-state-local-commands.e2e.test.ts
fleet-home-not-pollutedThe harness's HOME/USERPROFILE/FLEET_HOME isolation actually prevents writes to the real ~/.fleet/.tests\e2e\scenarios\fleet-home-not-polluted.e2e.test.ts

Opt-in, real LLM (10 scenarios — gated by env vars)

Gated by EITHER AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 (umbrella for all opt-in workflow scenarios) OR AGENTS_FLEET_E2E_RUN_FEATURE=1 (legacy gate from PR-4; both kept for compatibility).

ScenarioCostValidatesSource
code-review-basicLow/medium/code-review src/index.ts dispatches and catches durability regressions.tests\e2e\scenarios\code-review-basic.e2e.test.ts
feature-planning-cleanHigh#131: /feature --new writes planning artifacts while leaving src\ clean.tests\e2e\scenarios\feature-planning-clean.e2e.test.ts
feature-no-hangHigh#132: unattended /feature --new exits after all required artifacts are durable.tests\e2e\scenarios\feature-no-hang.e2e.test.ts
feature-no-contradictionsHigh#133: later /feature artifacts preserve requirements unless an explicit amendment records a scope change.tests\e2e\scenarios\feature-no-contradictions.e2e.test.ts
code-review-git-rangeLow/mediumPR #149 (GAP-4): /code-review HEAD~1..HEAD expands the range and writes CODE_REVIEW.md referencing the changed file.tests\e2e\scenarios\code-review-git-range.e2e.test.ts
feature-resumeVery high/feature --resume <slug> reuses the existing slug (no new plan dir) and keeps artifacts durable.tests\e2e\scenarios\feature-resume.e2e.test.ts
feature-from-brainstormHigh/feature --from-brainstorm <slug> seeds Phase 1 from the brainstorm context.tests\e2e\scenarios\feature-from-brainstorm.e2e.test.ts
feature-deep-brainstormLow/feature --deep-brainstorm under --unattended is refused with a clear message.tests\e2e\scenarios\feature-deep-brainstorm.e2e.test.ts
crew-generateLow#114 regression detector: /crew generate <name> --prompt "x" produces a <name>.crew.md shaped by the prompt.tests\e2e\scenarios\crew-generate.e2e.test.ts
brainstorm-standaloneLow/brainstorm "x" --rounds 1 --agents 0 writes 00-brainstorming.md.tests\e2e\scenarios\brainstorm-standalone.e2e.test.ts
init-investigationMedium/init writes .fleet/context/FLEET.md.tests\e2e\scenarios\init-investigation.e2e.test.ts

Tier 3 opt-in scenarios pin to the Copilot provider only (token budget). The provider-parity unit + integration tests already validate provider equivalence on the permission-gate path. Tier 1 default-on scenarios that exercise local command routing (not LLM behavior) run both Copilot + Claude provider flags.

Harness workspaces and cleanup

Each scenario gets a repo-local workspace under .e2e-runs\<scenario-name>-<timestamp>\:

  • project\ — synthetic git repository used as the CLI cwd
  • artifacts\ — harness artifact directory and token result file location
  • home\.fleet\ — isolated fleet home exposed through FLEET_HOME

Workspaces are deleted after both pass and failure by default. Set KEEP_E2E_ARTIFACTS=1 (or use npm run test:e2e:keep) to retain them for inspection. Cleanup uses bounded retries and logs cleanup-retained-locked-files if Windows keeps a file locked.

Token tracking and budgets

The harness attaches token usage to every RunResult:

ts
result.tokens.totalTokens;
result.tokens.usageSource; // 'e2e-result-file', 'session-file', or 'none'

Budgets are unlimited by default. Configure caps through scenario tokenBudget values or environment variables:

  • AGENTS_FLEET_E2E_MAX_TOTAL_TOKENS
  • AGENTS_FLEET_E2E_MAX_INPUT_TOKENS
  • AGENTS_FLEET_E2E_MAX_OUTPUT_TOKENS
  • AGENTS_FLEET_E2E_<SCENARIO_NAME>_MAX_TOTAL_TOKENS

Provider-matrix runs enforce budgets per (scenario, provider) tuple by resolving names like feature-no-hang-copilot before applying overrides. Exceeding a configured cap fails with token-budget-exceeded and prints the cap, actual usage, and usage source.

Provider matrix and auth

Provider-aware scenarios use runAgentsFleetProviderMatrix(). For each row the harness injects --provider <provider> using child_process.spawn() arg arrays, reports passed/failed/skipped, and prints a one-line summary with timeout and token budget.

Provider availability is checked before real provider rows run:

  • Copilot requires @github/copilot-sdk, gh on PATH, and gh auth status succeeding. Run gh auth login if needed.
  • Claude requires @anthropic-ai/claude-agent-sdk plus one of ANTHROPIC_API_KEY, CLAUDE_CODE_OAUTH_TOKEN, ANTHROPIC_AUTH_TOKEN, an existing Claude credentials file, or claude auth status succeeding.

Missing auth skips only that provider row unless the scenario explicitly overrides availability for no-LLM local command coverage.

Writing a scenario

Add scenario files under tests\e2e\scenarios\ with the .e2e.test.ts suffix and import helpers from ../harness.js:

ts
import { describe, expect, it } from 'vitest';
import { runAgentsFleet, withTempProject } from '../harness.js';

describe('my scenario', () => {
  it('does the thing', async () => {
    await withTempProject('my-scenario', async (ctx) => {
      const result = await runAgentsFleet({
        args: ['--unattended', '-p', '/version'],
        timeoutMs: 30_000,
      });

      expect(result.exitCode).toBe(0);
      expect(result.testDir).toBe(ctx.projectDir);
    });
  });
});

Use named withTempProject('scenario-name', ...) calls so artifact paths and token-budget environment overrides stay stable. New real-LLM scenarios must be opt-in; do not add default-on token-spending tests.

Cross-platform notes

  • Harness filesystem paths use path.join()/path.resolve() and native absolute paths.
  • CLI and git invocations use child_process.spawn(command, args) with no shell: true, avoiding PowerShell/Bash/zsh quoting differences.
  • Use normalizeOutputForAssertions() for stdout/stderr checks that compare paths or multiline output. It strips ANSI, normalizes EOLs, redacts scenario roots, and normalizes path separators in assertion strings.
  • Write fixtures with explicit LF (\n) and accept either LF or CRLF in generated files.

Troubleshooting

  • Scenario discovered but skipped: opt-in scenarios require AGENTS_FLEET_E2E_RUN_WORKFLOWS=1 or AGENTS_FLEET_E2E_RUN_FEATURE=1. Use npm run test:e2e:real-llm or set either env var manually.
  • Scenario times out: increase AGENTS_FLEET_E2E_FEATURE_TIMEOUT_MS for /feature scenarios, for example $env:AGENTS_FLEET_E2E_FEATURE_TIMEOUT_MS = '1800000' for 30 minutes.
  • Real ~\.fleet\ was polluted: file a bug. E2e runs should isolate state through FLEET_HOME, HOME, and USERPROFILE inside .e2e-runs\.
  • [PLANNING-DEBUG] lines on stderr: diagnostic planning logging is enabled. The lines are harmless and useful when debugging permission gates.
  • Need to inspect failure artifacts: rerun with KEEP_E2E_ARTIFACTS=1 or npm run test:e2e:keep; failure output includes retained workspace/artifact paths when available.