Fleet Intelligence

Fleet Intelligence is the self-improvement infrastructure for Agents Fleet. It continuously learns from your sessions, identifies patterns, generates data-driven suggestions, and evolves skill prompts over time.

Overview

Fleet Intelligence operates as a multi-phase pipeline:

Observe → Analyze → Suggest → Evolve → Verify → Assist

Observe — Captures session telemetry (agent runs, token usage, errors, skill activations)
Analyze — Aggregates data across sessions to surface patterns and trends
Suggest — Generates actionable suggestions based on statistical analysis
Evolve — Shadow evolution proposes, evaluates, and promotes skill prompt improvements
Verify — Deterministic verification and outcome backfill validate changes
Assist — Provides auto-memory, steering extraction, learning dashboards, and prompt-level guidance

Key principles:

Local-only — All data stays on your machine at ~/.fleet/intel/. Nothing is sent externally.
Human-in-the-loop — Suggestions and shadow proposals require explicit approval before being applied.
Always-on — No configuration needed. Intel collection starts automatically with every session.

Architecture

mermaid

graph TD
    User -->|input| REPL
    REPL --> CoordinatorEngine
    CoordinatorEngine --> SDKSession["SDK Session"]
    SDKSession --> EventListeners
    EventListeners --> FleetStateStore["FleetStateStore (events)"]

    FleetStateStore --> UIDisplay["UI/Display"]
    FleetStateStore --> IntelCollector

    IntelCollector --> IntelDB["IntelDatabase<br/>(~/.fleet/intel/fleet-intel.db)"]
    IntelDB --> SuggestionEngine
    IntelDB --> OutcomeBackfiller
    IntelDB --> SkillEvolver
    IntelDB --> EvalRunner
    SuggestionEngine --> IntelDB
    SkillEvolver -->|shadow proposals| IntelDB
    EvalRunner -->|eval results| IntelDB
    OutcomeBackfiller -->|merge/revert data| IntelDB
    IntelDB --> formatIntelContext["formatIntelContext()"]
    formatIntelContext --> CoordinatorPrompt["Coordinator Prompt<br/>&lt;fleet-intelligence&gt; section"]

    style IntelCollector fill:#2d6a4f,stroke:#1b4332,color:#fff
    style SuggestionEngine fill:#2d6a4f,stroke:#1b4332,color:#fff
    style IntelDB fill:#264653,stroke:#2a9d8f,color:#fff
    style formatIntelContext fill:#e76f51,stroke:#f4a261,color:#fff
    style SkillEvolver fill:#e76f51,stroke:#f4a261,color:#fff
    style EvalRunner fill:#e76f51,stroke:#f4a261,color:#fff
    style OutcomeBackfiller fill:#e76f51,stroke:#f4a261,color:#fff

Storage

All intelligence data is stored in a single SQLite database at ~/.fleet/intel/fleet-intel.db, managed by the IntelDatabase class:

Engine: better-sqlite3 (synchronous, in-process)
Journal mode: WAL (concurrent reads during writes)
Schema: Foreign keys enabled, auto-migration on startup
Access pattern: All queries use prepared statements (40+ registered in prepareStatements())
Legacy migration: On first run, existing JSON files from ~/.fleet/intel/*.json are migrated automatically via migrateJsonToSqlite()

Data Model

All types are defined in src/intel/types.ts.

SessionRecord

One record per CLI session. Captures the full picture of what happened:

Field	Type	Description
`version`	`1`	Schema version
`sessionId`	`string`	Unique session identifier
`startedAt`	`number`	Session start timestamp (epoch ms)
`endedAt`	`number?`	Session end timestamp
`model`	`string`	Primary model used
`cwd`	`string`	Working directory basename (PII: no full paths)
`activeCrew`	`string?`	Active crew name, if any
`totalTokens`	`{ input: number; output: number }`	Aggregate token consumption
`taskCount`	`number`	Number of tasks created
`taskCompletedCount`	`number`	Tasks that completed successfully
`taskFailedCount`	`number`	Tasks that failed
`agentRuns`	`AgentRunRecord[]`	All agent executions in the session
`errors`	`ErrorRecord[]`	All errors encountered
`skillUsage`	`SkillUsageRecord[]?`	Crew/skill activations

AgentRunRecord

Per-agent-run telemetry with outcome tracking:

Field	Type	Description
`agentId`	`string`	Worker identifier
`agentType`	`string`	`explorer`, `coder`, `reviewer`, `tester`, `general-purpose`
`taskId`	`string?`	Task identifier
`startedAt`	`number`	Run start timestamp (epoch ms)
`endedAt`	`number?`	Run end timestamp
`durationMs`	`number`	Run duration in milliseconds
`status`	`string`	`completed`, `failed`
`tokens`	`{ input: number; output: number }`	Tokens consumed by this agent
`toolUseCount`	`number`	Number of tool invocations
`topTools`	`string[]`	Top 3 tools by invocation count
`errorSummary`	`string?`	Redacted error summary (max 200 chars)
`worktreePath`	`string?`	Git worktree path (if applicable)
`model`	`string?`	Model used for this agent
`taskSubject`	`string?`	What the agent was working on (PII-redacted)
`commitSha`	`string?`	Git commit SHA (for outcome correlation)
`branchName`	`string?`	Git branch name (for outcome correlation)
`skillName`	`string?`	Skill name used for this run
`crewName`	`string?`	Active crew during this run
`outcomeMerged`	`boolean?`	Whether changes were merged to main
`outcomeMergedAt`	`number?`	When the merge happened (epoch ms)
`outcomeRevertedWithinDays`	`number?`	Days until revert (undefined = not reverted)
`outcomeDodAllPassed`	`boolean?`	Whether all DoD items passed
`outcomeAttachedAt`	`number?`	When outcome was attached (epoch ms)

ErrorRecord

Classified error tracking:

Field	Type	Description
`timestamp`	`number`	When the error occurred
`agentId`	`string`	Which agent encountered the error
`agentType`	`string`	Agent type
`errorType`	`ErrorType`	Error classification
`message`	`string`	Redacted error message (max 200 chars)

Error types: rate_limit, tool_failure, timeout, permission, model_error, unknown

SkillUsageRecord

Tracks crew and skill activations:

Field	Type	Description
`skillName`	`string`	Name of the skill
`crewName`	`string?`	Crew name (if activated via crew)
`activatedAt`	`number`	Activation timestamp

Suggestion

Generated by the SuggestionEngine:

Field	Type	Description
`id`	`string`	Unique suggestion identifier
`type`	`SuggestionType`	`classifier`, `decomposition`, `resource`
`title`	`string`	Human-readable summary
`description`	`string`	Detailed explanation
`evidence`	`string`	Statistical backing data
`confidence`	`number`	0–100 confidence score
`createdAt`	`number`	When the suggestion was generated
`applied`	`boolean`	Whether the user has applied this suggestion
`appliedAt`	`number?`	When it was applied
`dismissed`	`boolean?`	Whether the user has dismissed this suggestion

ShadowRecord

Tracks shadow evolution candidates (see Shadow Evolution):

Field	Type	Description
`id`	`string`	Unique shadow identifier
`skillName`	`string`	Target skill
`proposedVersion`	`string`	Proposed new version
`currentVersion`	`string`	Current live version
`patch`	`string`	Text to append to skill prompt
`channel`	`EvolutionChannel`	`steering`, `error_pattern`, `success_replication`, `manual`
`confidence`	`number`	0–100 confidence score
`evidence`	`string[]`	Reasons for this proposal
`createdAt`	`number`	Creation timestamp
`promotedAt`	`number?`	When promoted to live
`rejectedAt`	`number?`	When rejected
`rejectionReason`	`string?`	Why it was rejected
`shadowRuns`	`number`	Total A/B evaluation runs
`shadowWins`	`number`	Runs where shadow outperformed current
`shadowLosses`	`number`	Runs where current outperformed shadow
`shadowTies`	`number`	Inconclusive runs
`evalScore`	`number?`	Aggregate eval score
`evalRuns`	`number`	Number of eval runs completed

SteeringInsight

Auto-extracted user preferences from conversation history:

Field	Type	Description
`id`	`string`	Unique identifier
`sessionId`	`string`	Source session
`category`	`SteeringCategory`	`preference`, `prohibition`, `correction`, `convention`, `tool_directive`
`rawMessage`	`string`	PII-redacted original user message
`extractedInsight`	`string`	The actionable learning
`confidence`	`number`	0–100 confidence score
`createdAt`	`number`	Epoch ms
`persisted`	`boolean`	Whether written to memory.md
`insightHash`	`string`	SHA-256 hash for dedup

Phase 1: Observe

The IntelCollector subscribes to FleetStateStore events and captures telemetry in real time.

What It Captures

Agent spawned — type, model, task subject, skill/crew context
Agent completed/failed — duration, tokens (input + output), status, error info, commit SHA
Token usage — per-agent and per-model consumption (split by input/output)
Errors — classified by type with redacted messages
Skill activations — crew/skill name, activation time

PII Redaction

All data passes through the PiiRedactor before storage:

API keys and tokens → [REDACTED_KEY]
File paths → normalized (home directory → ~)
Email addresses → [REDACTED_EMAIL]
URLs with credentials → credentials stripped

Storage

Periodic flush: every 30 seconds during active sessions
Finalize: full flush on session end
Database: SQLite at ~/.fleet/intel/fleet-intel.db
Pruning: automatic at startup — 90 days max age

Example Session Record

json

{
  "version": 1,
  "sessionId": "sess_abc123",
  "startedAt": 1714400000000,
  "endedAt": 1714403600000,
  "model": "claude-opus-4.6",
  "cwd": "my-project",
  "totalTokens": { "input": 180000, "output": 65000 },
  "taskCount": 8,
  "taskCompletedCount": 7,
  "taskFailedCount": 1,
  "agentRuns": [
    {
      "agentId": "worker-1",
      "agentType": "coder",
      "startedAt": 1714400100000,
      "endedAt": 1714400145000,
      "durationMs": 45000,
      "status": "completed",
      "tokens": { "input": 24000, "output": 8000 },
      "toolUseCount": 12,
      "topTools": ["edit", "view", "powershell"],
      "model": "claude-sonnet-4.5",
      "taskSubject": "Implement auth middleware",
      "worktreePath": ".worktrees/worker-1",
      "commitSha": "a1b2c3d",
      "branchName": "worktree/worker-1"
    }
  ],
  "errors": [],
  "skillUsage": []
}

Phase 2: Analyze

The analyze phase provides commands to query and explore your fleet's historical data.

Commands

`/fleet-intel summary`

High-level overview across all recorded sessions:

📊 Fleet Intelligence Summary
─────────────────────────────
Sessions:     47
Success Rate: 89.4%
Total Tokens: 12,450,000
Avg Duration: 42m 15s
Top Errors:   rate_limit (12), tool_failure (8), timeout (3)

`/fleet-intel agents`

Per agent-type statistics:

🤖 Agent Type Stats
───────────────────
Type         Runs  Failures  Avg Duration  Tokens
coder         134        8       3m 20s    2,100k
explorer       89        2       1m 45s      890k
reviewer       67        1       2m 10s      670k
tester         45        5       4m 30s      450k
general        23        3       5m 15s      340k

`/fleet-intel failures`

Top error types ranked by frequency:

❌ Failure Analysis
───────────────────
Type              Count  % of Total
rate_limit           12      37.5%
tool_failure          8      25.0%
timeout               3       9.4%
permission            2       6.3%
model_error           1       3.1%
unknown               4      12.5%

`/fleet-intel tokens`

Token usage sorted by agent type:

📈 Token Usage by Agent Type
────────────────────────────
Type         Total Tokens  Avg/Run   % of Total
coder          2,100,000    15,672      47.2%
explorer         890,000     9,888      20.0%
reviewer         670,000    10,000      15.1%
tester           450,000    10,000      10.1%
general          340,000    14,783       7.6%

`/fleet-intel stats`

Top 5 sessions by token usage plus usage trends:

📊 Session Stats
────────────────
Top 5 Sessions by Token Usage:
  1. sess_abc123  —  245,000 tokens  (42m, 8 tasks)
  2. sess_def456  —  198,000 tokens  (35m, 6 tasks)
  ...

Token Usage Trends:
  Recent (7d):   850,000 tokens across 12 sessions
  Previous (7d): 1,200,000 tokens across 15 sessions
  Change:        -29.2% ↓

`/fleet-intel search <query>`

Substring search across sessions — matches against errors, agent types, models, and task subjects:

🔍 Search: "rate_limit"
──────────────────────
Found 12 matches across 8 sessions:
  sess_abc123: 3 rate_limit errors (claude-opus-4.6)
  sess_def456: 2 rate_limit errors (claude-opus-4.6)
  ...

`/fleet-intel skills`

Skill and crew usage summary:

🎯 Skill Usage
──────────────
Skill              Uses  Last Used
code-review          15  2h ago
init-investigation    8  1d ago
feature-planning      6  3d ago
research              4  5d ago

Phase 3: Suggest

The SuggestionEngine runs three statistical analyzers against your session history to generate actionable suggestions.

Analyzers

Classifier Analyzer

Flags agent types with a success rate below 60% (minimum 10 runs required for statistical significance):

"Your tester agents have a 45% success rate across 22 runs. Consider breaking test tasks into smaller scopes or switching to a more capable model."

Decomposition Analyzer

Correlates the number of workers spawned with task completion rates:

"Sessions with 4–6 workers have 92% task completion vs 71% for sessions with 8+ workers. Consider limiting parallelism for complex tasks."

Resource Analyzer

Identifies cases where expensive models achieve similar success rates to cheaper alternatives:

"claude-opus-4.6 and claude-sonnet-4.5 have similar success rates for explorer tasks (94% vs 91%), but Opus uses 2.3x more tokens. Consider using Sonnet for exploration."

Context Injection

The top 3 pending suggestions (by confidence) are automatically included in the coordinator's prompt within a <fleet-intelligence> section. This gives the coordinator awareness of patterns without requiring user action.

Shadow Evolution

Shadow evolution lets you safely test skill prompt improvements before committing them. Instead of directly modifying skill files, changes are proposed as shadow records that accumulate A/B evaluation data.

Pipeline

SkillEvolver → Shadow Proposal → EvalRunner A/B → EvalJudge → Human Decision

SkillEvolver analyzes telemetry for a crew's skills and proposes prompt improvements via four channels:
- steering — insights extracted from user corrections/preferences
- error_pattern — recurring error patterns suggest prompt additions
- success_replication — high-performing runs inform what works
- manual — user-initiated changes
Shadow proposals are stored in IntelDatabase with the proposed patch, confidence score, and evidence.
EvalRunner runs A/B comparisons: current skill prompt vs shadow prompt on the same task. For coders, each gets an isolated worktree. Both outputs pass through VerifierNode (deterministic commands). If verification alone is inconclusive, EvalJudge spawns an LLM judge agent.
Results accumulate as shadowWins, shadowLosses, shadowTies. When enough data exists, the user can promote or reject.

Commands

/learn evolve <crew>       # Generate shadow proposals from telemetry
/learn shadows             # List active shadows with win/loss stats
/learn eval <skill>        # Run A/B evaluation (uses SelectMenu picker if no arg)
/learn promote <skill>     # Promote shadow to live (overwrites skill file)
/learn reject <skill>      # Reject and discard shadow

Configuration

json

{
  "evolution": {
    "enabled": true,
    "autoEvolve": false,
    "minSampleSize": 5,
    "confidenceThreshold": 75,
    "rollbackAfterSessions": 5
  },
  "shadow": {
    "minShadowRuns": 5,
    "winRateThreshold": 0.6
  }
}

Outcome Backfill

The OutcomeBackfiller retroactively checks whether agent-produced changes stuck by querying git history:

Was the commit merged to main? → outcomeMerged
Was the commit reverted? → outcomeRevertedWithinDays
Did all DoD items pass? → outcomeDodAllPassed

This data enables effectiveStatus — a run that "completed" but was later reverted is downgraded. The backfiller runs periodically with configurable grace period (default 24h) and revert window (default 7 days).

Outcome data feeds back into the dashboard's gatePassedButReverted metric, helping identify skills that pass tests but produce low-quality changes.

Eval Framework

VerifierNode

Deterministic, non-LLM command executor. Runs whitelisted shell commands (npm, npx, vitest, jest, tsc, eslint, etc.) and returns structured results:

interface VerifierResult {
  command: string;
  args: string[];
  exitCode: number;
  stdout: string;    // truncated to 10KB
  stderr: string;
  durationMs: number;
  passed: boolean;   // exitCode === 0
}

Default timeout: 2 minutes. Output capped at 10KB.

EvalRunner

Orchestrates A/B comparisons between current and shadow skill prompts:

Both versions run the same task (coder agents get isolated worktrees)
Both outputs pass through VerifierNode
If verification is inconclusive, EvalJudge provides LLM-based comparison

Results: shadow_wins, current_wins, tie, both_fail, error.

EvalJudge

Spawns an impartial LLM judge agent that compares two outputs on correctness, quality, and completeness. Returns a_wins, b_wins, or tie with a one-sentence reason.

EvalScheduler

Periodic background evaluator. Runs every 30 minutes (configurable), iterates over all active shadows, and runs eval passes automatically. Integrates with the LoopScheduler lifecycle.

Steering Extraction

The SteeringExtractor uses LLM analysis at session end to discover persistent user preferences from conversation history.

Category	Example
`preference`	"Always use pnpm, not npm"
`prohibition`	"Never modify the migrations directory"
`correction`	"The API uses v2 endpoints, not v1"
`convention`	"Use kebab-case for file names"
`tool_directive`	"Run tests with --runInBand"

Pipeline

Last N messages (default 50) are sent to an LLM for analysis
Insights are extracted with category and confidence score
Duplicates are detected via SHA-256 hash of normalized text
Insights above the memory threshold (default 70%) are auto-appended to .fleet/context/memory.md
All insights are stored in IntelDatabase for querying

Configuration

json

{
  "steering": {
    "enabled": true,
    "timeoutMs": 60000,
    "maxInsights": 10,
    "maxMessages": 50,
    "memoryThreshold": 70
  }
}

Phase 4: Assist

The assist phase provides active learning and memory features.

Auto-Memory

When a session completes with ≥3 tasks finished, a session summary is automatically appended to .fleet/context/memory.md:

markdown

## Session 2026-04-29 — Auth Module Refactor
- Completed 7/8 tasks in 42 minutes
- Key learnings: JWT middleware needed custom error handler
- Model: claude-opus-4.6, Tokens: 245,000

This memory file is auto-loaded into the coordinator prompt via ContextStore, giving future sessions access to past learnings.

Skill Memories (crew-scoped)

The SkillMemoryStore (src/intel/SkillMemoryStore.ts) persists short textual memories keyed by skill name and optionally scoped to a crew. When a crew is active, memories are stored with that crew's name and queries return crew-matched rows PLUS legacy unscoped (crew_name IS NULL) rows — so pre-existing memories remain visible. When no crew is active, only unscoped rows are returned (backward compatible). Deduplication is per (skill, crew, content). Skill memories are FLEET_HOME-isolatable per-project (stored in the intel DB at $FLEET_HOME/intel/fleet-intel.db).

`/fleet-intel remember <note>`

Manually add notes to project memory:

/fleet-intel remember "This project uses pnpm, not npm — always use pnpm install"

`/learn` Command

Unified learning dashboard combining suggestions, memory, evolution, and stats:

Sub-command	Description
`/learn overview`	Session insights + pending suggestions (default)
`/learn steering`	View extracted steering insights
`/learn extract`	Run steering extraction on current conversation
`/learn evolve <crew>`	Propose skill prompt improvements from telemetry
`/learn dashboard <crew>`	Evolution metrics with before/after comparison
`/learn shadows`	List active shadow proposals with A/B stats
`/learn eval <skill>`	Run A/B evaluation of current vs shadow prompt
`/learn promote <skill>`	Promote a shadow to live
`/learn reject <skill> [reason]`	Reject and discard a shadow
`/learn apply <id>`	Apply a suggestion
`/learn dismiss <id>`	Dismiss a suggestion
`/learn clear-memory`	Reset the memory file

LoopScheduler Integration

The LoopScheduler triggers periodic memory consolidation prompts every 30 minutes during active sessions. This prompts the coordinator to reflect on what's been learned and update memory accordingly.

Skill Versioning

When a skill file is overwritten:

A backup of the previous version is saved automatically
The version number is auto-incremented
Rollback is possible by restoring the backup

`/skill create`

Generate skill templates:

bash

/skill create auth-checker --type reviewer --desc "Validates authentication patterns"

Creates a structured skill file with the specified type and description.

Configuration

Fleet Intelligence is always-on — no configuration needed to get started.

Storage Locations

Path	Purpose
`~/.fleet/intel/fleet-intel.db`	SQLite database (sessions, runs, suggestions, shadows, steering)
`.fleet/context/memory.md`	Auto-memory (per-project, loaded into coordinator prompt)

Memory Integration

The .fleet/context/memory.md file is automatically loaded into the coordinator prompt by the ContextStore system (same mechanism used by /init context). No manual configuration required.

Fleet Intelligence ​

Overview ​

Architecture ​

Storage ​

Data Model ​

SessionRecord ​

AgentRunRecord ​

ErrorRecord ​

SkillUsageRecord ​

Suggestion ​

ShadowRecord ​

SteeringInsight ​

Phase 1: Observe ​

What It Captures ​

PII Redaction ​

Storage ​

Example Session Record ​

Phase 2: Analyze ​

Commands ​

/fleet-intel summary ​

/fleet-intel agents ​

/fleet-intel failures ​

/fleet-intel tokens ​

/fleet-intel stats ​

/fleet-intel search <query> ​

/fleet-intel skills ​

Phase 3: Suggest ​

Analyzers ​

Classifier Analyzer ​

Decomposition Analyzer ​

Resource Analyzer ​

Context Injection ​

Shadow Evolution ​

Pipeline ​

Commands ​

Configuration ​

Outcome Backfill ​

Eval Framework ​

VerifierNode ​

EvalRunner ​

EvalJudge ​

EvalScheduler ​

Steering Extraction ​

Categories ​

Pipeline ​

Configuration ​

Phase 4: Assist ​

Auto-Memory ​

Skill Memories (crew-scoped) ​

/fleet-intel remember <note> ​

/learn Command ​

LoopScheduler Integration ​

Skill Versioning ​

/skill create ​

Configuration ​

Storage Locations ​

Memory Integration ​

Fleet Intelligence

Overview

Architecture

Storage

Data Model

SessionRecord

AgentRunRecord

ErrorRecord

SkillUsageRecord

Suggestion

ShadowRecord

SteeringInsight

Phase 1: Observe

What It Captures

PII Redaction

Storage

Example Session Record

Phase 2: Analyze

Commands

`/fleet-intel summary`

`/fleet-intel agents`

`/fleet-intel failures`

`/fleet-intel tokens`

`/fleet-intel stats`

`/fleet-intel search <query>`

`/fleet-intel skills`

Phase 3: Suggest

Analyzers

Classifier Analyzer

Decomposition Analyzer

Resource Analyzer

Context Injection

Shadow Evolution

Pipeline

Commands

Configuration

Outcome Backfill

Eval Framework

VerifierNode

EvalRunner

EvalJudge

EvalScheduler

Steering Extraction

Categories

Pipeline

Configuration

Phase 4: Assist

Auto-Memory

Skill Memories (crew-scoped)

`/fleet-intel remember <note>`

`/learn` Command

LoopScheduler Integration

Skill Versioning

`/skill create`

Configuration

Storage Locations

Memory Integration