Skip to content

Fleet Intelligence

Fleet Intelligence is the self-improvement infrastructure for Agents Fleet. It continuously learns from your sessions, identifies patterns, generates data-driven suggestions, and evolves skill prompts over time.

Overview

Fleet Intelligence operates as a multi-phase pipeline:

Observe → Analyze → Suggest → Evolve → Verify → Assist
  1. Observe — Captures session telemetry (agent runs, token usage, errors, skill activations)
  2. Analyze — Aggregates data across sessions to surface patterns and trends
  3. Suggest — Generates actionable suggestions based on statistical analysis
  4. Evolve — Shadow evolution proposes, evaluates, and promotes skill prompt improvements
  5. Verify — Deterministic verification and outcome backfill validate changes
  6. Assist — Provides auto-memory, steering extraction, learning dashboards, and prompt-level guidance

Key principles:

  • Local-only — All data stays on your machine at ~/.fleet/intel/. Nothing is sent externally.
  • Human-in-the-loop — Suggestions and shadow proposals require explicit approval before being applied.
  • Always-on — No configuration needed. Intel collection starts automatically with every session.

Architecture

mermaid
graph TD
    User -->|input| REPL
    REPL --> CoordinatorEngine
    CoordinatorEngine --> SDKSession["SDK Session"]
    SDKSession --> EventListeners
    EventListeners --> FleetStateStore["FleetStateStore (events)"]

    FleetStateStore --> UIDisplay["UI/Display"]
    FleetStateStore --> IntelCollector

    IntelCollector --> IntelDB["IntelDatabase<br/>(~/.fleet/intel/fleet-intel.db)"]
    IntelDB --> SuggestionEngine
    IntelDB --> OutcomeBackfiller
    IntelDB --> SkillEvolver
    IntelDB --> EvalRunner
    SuggestionEngine --> IntelDB
    SkillEvolver -->|shadow proposals| IntelDB
    EvalRunner -->|eval results| IntelDB
    OutcomeBackfiller -->|merge/revert data| IntelDB
    IntelDB --> formatIntelContext["formatIntelContext()"]
    formatIntelContext --> CoordinatorPrompt["Coordinator Prompt<br/>&lt;fleet-intelligence&gt; section"]

    style IntelCollector fill:#2d6a4f,stroke:#1b4332,color:#fff
    style SuggestionEngine fill:#2d6a4f,stroke:#1b4332,color:#fff
    style IntelDB fill:#264653,stroke:#2a9d8f,color:#fff
    style formatIntelContext fill:#e76f51,stroke:#f4a261,color:#fff
    style SkillEvolver fill:#e76f51,stroke:#f4a261,color:#fff
    style EvalRunner fill:#e76f51,stroke:#f4a261,color:#fff
    style OutcomeBackfiller fill:#e76f51,stroke:#f4a261,color:#fff

Storage

All intelligence data is stored in a single SQLite database at ~/.fleet/intel/fleet-intel.db, managed by the IntelDatabase class:

  • Engine: better-sqlite3 (synchronous, in-process)
  • Journal mode: WAL (concurrent reads during writes)
  • Schema: Foreign keys enabled, auto-migration on startup
  • Access pattern: All queries use prepared statements (40+ registered in prepareStatements())
  • Legacy migration: On first run, existing JSON files from ~/.fleet/intel/*.json are migrated automatically via migrateJsonToSqlite()

Data Model

All types are defined in src/intel/types.ts.

SessionRecord

One record per CLI session. Captures the full picture of what happened:

FieldTypeDescription
version1Schema version
sessionIdstringUnique session identifier
startedAtnumberSession start timestamp (epoch ms)
endedAtnumber?Session end timestamp
modelstringPrimary model used
cwdstringWorking directory basename (PII: no full paths)
activeCrewstring?Active crew name, if any
totalTokens{ input: number; output: number }Aggregate token consumption
taskCountnumberNumber of tasks created
taskCompletedCountnumberTasks that completed successfully
taskFailedCountnumberTasks that failed
agentRunsAgentRunRecord[]All agent executions in the session
errorsErrorRecord[]All errors encountered
skillUsageSkillUsageRecord[]?Crew/skill activations

AgentRunRecord

Per-agent-run telemetry with outcome tracking:

FieldTypeDescription
agentIdstringWorker identifier
agentTypestringexplorer, coder, reviewer, tester, general-purpose
taskIdstring?Task identifier
startedAtnumberRun start timestamp (epoch ms)
endedAtnumber?Run end timestamp
durationMsnumberRun duration in milliseconds
statusstringcompleted, failed
tokens{ input: number; output: number }Tokens consumed by this agent
toolUseCountnumberNumber of tool invocations
topToolsstring[]Top 3 tools by invocation count
errorSummarystring?Redacted error summary (max 200 chars)
worktreePathstring?Git worktree path (if applicable)
modelstring?Model used for this agent
taskSubjectstring?What the agent was working on (PII-redacted)
commitShastring?Git commit SHA (for outcome correlation)
branchNamestring?Git branch name (for outcome correlation)
skillNamestring?Skill name used for this run
crewNamestring?Active crew during this run
outcomeMergedboolean?Whether changes were merged to main
outcomeMergedAtnumber?When the merge happened (epoch ms)
outcomeRevertedWithinDaysnumber?Days until revert (undefined = not reverted)
outcomeDodAllPassedboolean?Whether all DoD items passed
outcomeAttachedAtnumber?When outcome was attached (epoch ms)

ErrorRecord

Classified error tracking:

FieldTypeDescription
timestampnumberWhen the error occurred
agentIdstringWhich agent encountered the error
agentTypestringAgent type
errorTypeErrorTypeError classification
messagestringRedacted error message (max 200 chars)

Error types: rate_limit, tool_failure, timeout, permission, model_error, unknown

SkillUsageRecord

Tracks crew and skill activations:

FieldTypeDescription
skillNamestringName of the skill
crewNamestring?Crew name (if activated via crew)
activatedAtnumberActivation timestamp

Suggestion

Generated by the SuggestionEngine:

FieldTypeDescription
idstringUnique suggestion identifier
typeSuggestionTypeclassifier, decomposition, resource
titlestringHuman-readable summary
descriptionstringDetailed explanation
evidencestringStatistical backing data
confidencenumber0–100 confidence score
createdAtnumberWhen the suggestion was generated
appliedbooleanWhether the user has applied this suggestion
appliedAtnumber?When it was applied
dismissedboolean?Whether the user has dismissed this suggestion

ShadowRecord

Tracks shadow evolution candidates (see Shadow Evolution):

FieldTypeDescription
idstringUnique shadow identifier
skillNamestringTarget skill
proposedVersionstringProposed new version
currentVersionstringCurrent live version
patchstringText to append to skill prompt
channelEvolutionChannelsteering, error_pattern, success_replication, manual
confidencenumber0–100 confidence score
evidencestring[]Reasons for this proposal
createdAtnumberCreation timestamp
promotedAtnumber?When promoted to live
rejectedAtnumber?When rejected
rejectionReasonstring?Why it was rejected
shadowRunsnumberTotal A/B evaluation runs
shadowWinsnumberRuns where shadow outperformed current
shadowLossesnumberRuns where current outperformed shadow
shadowTiesnumberInconclusive runs
evalScorenumber?Aggregate eval score
evalRunsnumberNumber of eval runs completed

SteeringInsight

Auto-extracted user preferences from conversation history:

FieldTypeDescription
idstringUnique identifier
sessionIdstringSource session
categorySteeringCategorypreference, prohibition, correction, convention, tool_directive
rawMessagestringPII-redacted original user message
extractedInsightstringThe actionable learning
confidencenumber0–100 confidence score
createdAtnumberEpoch ms
persistedbooleanWhether written to memory.md
insightHashstringSHA-256 hash for dedup

Phase 1: Observe

The IntelCollector subscribes to FleetStateStore events and captures telemetry in real time.

What It Captures

  • Agent spawned — type, model, task subject, skill/crew context
  • Agent completed/failed — duration, tokens (input + output), status, error info, commit SHA
  • Token usage — per-agent and per-model consumption (split by input/output)
  • Errors — classified by type with redacted messages
  • Skill activations — crew/skill name, activation time

PII Redaction

All data passes through the PiiRedactor before storage:

  • API keys and tokens → [REDACTED_KEY]
  • File paths → normalized (home directory → ~)
  • Email addresses → [REDACTED_EMAIL]
  • URLs with credentials → credentials stripped

Storage

  • Periodic flush: every 30 seconds during active sessions
  • Finalize: full flush on session end
  • Database: SQLite at ~/.fleet/intel/fleet-intel.db
  • Pruning: automatic at startup — 90 days max age

Example Session Record

json
{
  "version": 1,
  "sessionId": "sess_abc123",
  "startedAt": 1714400000000,
  "endedAt": 1714403600000,
  "model": "claude-opus-4.6",
  "cwd": "my-project",
  "totalTokens": { "input": 180000, "output": 65000 },
  "taskCount": 8,
  "taskCompletedCount": 7,
  "taskFailedCount": 1,
  "agentRuns": [
    {
      "agentId": "worker-1",
      "agentType": "coder",
      "startedAt": 1714400100000,
      "endedAt": 1714400145000,
      "durationMs": 45000,
      "status": "completed",
      "tokens": { "input": 24000, "output": 8000 },
      "toolUseCount": 12,
      "topTools": ["edit", "view", "powershell"],
      "model": "claude-sonnet-4.5",
      "taskSubject": "Implement auth middleware",
      "worktreePath": ".worktrees/worker-1",
      "commitSha": "a1b2c3d",
      "branchName": "worktree/worker-1"
    }
  ],
  "errors": [],
  "skillUsage": []
}

Phase 2: Analyze

The analyze phase provides commands to query and explore your fleet's historical data.

Commands

/fleet-intel summary

High-level overview across all recorded sessions:

📊 Fleet Intelligence Summary
─────────────────────────────
Sessions:     47
Success Rate: 89.4%
Total Tokens: 12,450,000
Avg Duration: 42m 15s
Top Errors:   rate_limit (12), tool_failure (8), timeout (3)

/fleet-intel agents

Per agent-type statistics:

🤖 Agent Type Stats
───────────────────
Type         Runs  Failures  Avg Duration  Tokens
coder         134        8       3m 20s    2,100k
explorer       89        2       1m 45s      890k
reviewer       67        1       2m 10s      670k
tester         45        5       4m 30s      450k
general        23        3       5m 15s      340k

/fleet-intel failures

Top error types ranked by frequency:

❌ Failure Analysis
───────────────────
Type              Count  % of Total
rate_limit           12      37.5%
tool_failure          8      25.0%
timeout               3       9.4%
permission            2       6.3%
model_error           1       3.1%
unknown               4      12.5%

/fleet-intel tokens

Token usage sorted by agent type:

📈 Token Usage by Agent Type
────────────────────────────
Type         Total Tokens  Avg/Run   % of Total
coder          2,100,000    15,672      47.2%
explorer         890,000     9,888      20.0%
reviewer         670,000    10,000      15.1%
tester           450,000    10,000      10.1%
general          340,000    14,783       7.6%

/fleet-intel stats

Top 5 sessions by token usage plus usage trends:

📊 Session Stats
────────────────
Top 5 Sessions by Token Usage:
  1. sess_abc123  —  245,000 tokens  (42m, 8 tasks)
  2. sess_def456  —  198,000 tokens  (35m, 6 tasks)
  ...

Token Usage Trends:
  Recent (7d):   850,000 tokens across 12 sessions
  Previous (7d): 1,200,000 tokens across 15 sessions
  Change:        -29.2% ↓

/fleet-intel search <query>

Substring search across sessions — matches against errors, agent types, models, and task subjects:

🔍 Search: "rate_limit"
──────────────────────
Found 12 matches across 8 sessions:
  sess_abc123: 3 rate_limit errors (claude-opus-4.6)
  sess_def456: 2 rate_limit errors (claude-opus-4.6)
  ...

/fleet-intel skills

Skill and crew usage summary:

🎯 Skill Usage
──────────────
Skill              Uses  Last Used
code-review          15  2h ago
init-investigation    8  1d ago
feature-planning      6  3d ago
research              4  5d ago

Phase 3: Suggest

The SuggestionEngine runs three statistical analyzers against your session history to generate actionable suggestions.

Analyzers

Classifier Analyzer

Flags agent types with a success rate below 60% (minimum 10 runs required for statistical significance):

"Your tester agents have a 45% success rate across 22 runs. Consider breaking test tasks into smaller scopes or switching to a more capable model."

Decomposition Analyzer

Correlates the number of workers spawned with task completion rates:

"Sessions with 4–6 workers have 92% task completion vs 71% for sessions with 8+ workers. Consider limiting parallelism for complex tasks."

Resource Analyzer

Identifies cases where expensive models achieve similar success rates to cheaper alternatives:

"claude-opus-4.6 and claude-sonnet-4.5 have similar success rates for explorer tasks (94% vs 91%), but Opus uses 2.3x more tokens. Consider using Sonnet for exploration."

Context Injection

The top 3 pending suggestions (by confidence) are automatically included in the coordinator's prompt within a <fleet-intelligence> section. This gives the coordinator awareness of patterns without requiring user action.

Shadow Evolution

Shadow evolution lets you safely test skill prompt improvements before committing them. Instead of directly modifying skill files, changes are proposed as shadow records that accumulate A/B evaluation data.

Pipeline

SkillEvolver → Shadow Proposal → EvalRunner A/B → EvalJudge → Human Decision
  1. SkillEvolver analyzes telemetry for a crew's skills and proposes prompt improvements via four channels:

    • steering — insights extracted from user corrections/preferences
    • error_pattern — recurring error patterns suggest prompt additions
    • success_replication — high-performing runs inform what works
    • manual — user-initiated changes
  2. Shadow proposals are stored in IntelDatabase with the proposed patch, confidence score, and evidence.

  3. EvalRunner runs A/B comparisons: current skill prompt vs shadow prompt on the same task. For coders, each gets an isolated worktree. Both outputs pass through VerifierNode (deterministic commands). If verification alone is inconclusive, EvalJudge spawns an LLM judge agent.

  4. Results accumulate as shadowWins, shadowLosses, shadowTies. When enough data exists, the user can promote or reject.

Commands

/learn evolve <crew>       # Generate shadow proposals from telemetry
/learn shadows             # List active shadows with win/loss stats
/learn eval <skill>        # Run A/B evaluation (uses SelectMenu picker if no arg)
/learn promote <skill>     # Promote shadow to live (overwrites skill file)
/learn reject <skill>      # Reject and discard shadow

Configuration

json
{
  "evolution": {
    "enabled": true,
    "autoEvolve": false,
    "minSampleSize": 5,
    "confidenceThreshold": 75,
    "rollbackAfterSessions": 5
  },
  "shadow": {
    "minShadowRuns": 5,
    "winRateThreshold": 0.6
  }
}

Outcome Backfill

The OutcomeBackfiller retroactively checks whether agent-produced changes stuck by querying git history:

  • Was the commit merged to main?outcomeMerged
  • Was the commit reverted?outcomeRevertedWithinDays
  • Did all DoD items pass?outcomeDodAllPassed

This data enables effectiveStatus — a run that "completed" but was later reverted is downgraded. The backfiller runs periodically with configurable grace period (default 24h) and revert window (default 7 days).

Outcome data feeds back into the dashboard's gatePassedButReverted metric, helping identify skills that pass tests but produce low-quality changes.

Eval Framework

VerifierNode

Deterministic, non-LLM command executor. Runs whitelisted shell commands (npm, npx, vitest, jest, tsc, eslint, etc.) and returns structured results:

ts
interface VerifierResult {
  command: string;
  args: string[];
  exitCode: number;
  stdout: string;    // truncated to 10KB
  stderr: string;
  durationMs: number;
  passed: boolean;   // exitCode === 0
}

Default timeout: 2 minutes. Output capped at 10KB.

EvalRunner

Orchestrates A/B comparisons between current and shadow skill prompts:

  1. Both versions run the same task (coder agents get isolated worktrees)
  2. Both outputs pass through VerifierNode
  3. If verification is inconclusive, EvalJudge provides LLM-based comparison

Results: shadow_wins, current_wins, tie, both_fail, error.

EvalJudge

Spawns an impartial LLM judge agent that compares two outputs on correctness, quality, and completeness. Returns a_wins, b_wins, or tie with a one-sentence reason.

EvalScheduler

Periodic background evaluator. Runs every 30 minutes (configurable), iterates over all active shadows, and runs eval passes automatically. Integrates with the LoopScheduler lifecycle.

Steering Extraction

The SteeringExtractor uses LLM analysis at session end to discover persistent user preferences from conversation history.

Categories

CategoryExample
preference"Always use pnpm, not npm"
prohibition"Never modify the migrations directory"
correction"The API uses v2 endpoints, not v1"
convention"Use kebab-case for file names"
tool_directive"Run tests with --runInBand"

Pipeline

  1. Last N messages (default 50) are sent to an LLM for analysis
  2. Insights are extracted with category and confidence score
  3. Duplicates are detected via SHA-256 hash of normalized text
  4. Insights above the memory threshold (default 70%) are auto-appended to .fleet/context/memory.md
  5. All insights are stored in IntelDatabase for querying

Configuration

json
{
  "steering": {
    "enabled": true,
    "timeoutMs": 60000,
    "maxInsights": 10,
    "maxMessages": 50,
    "memoryThreshold": 70
  }
}

Phase 4: Assist

The assist phase provides active learning and memory features.

Auto-Memory

When a session completes with ≥3 tasks finished, a session summary is automatically appended to .fleet/context/memory.md:

markdown
## Session 2026-04-29 — Auth Module Refactor
- Completed 7/8 tasks in 42 minutes
- Key learnings: JWT middleware needed custom error handler
- Model: claude-opus-4.6, Tokens: 245,000

This memory file is auto-loaded into the coordinator prompt via ContextStore, giving future sessions access to past learnings.

Skill Memories (crew-scoped)

The SkillMemoryStore (src/intel/SkillMemoryStore.ts) persists short textual memories keyed by skill name and optionally scoped to a crew. When a crew is active, memories are stored with that crew's name and queries return crew-matched rows PLUS legacy unscoped (crew_name IS NULL) rows — so pre-existing memories remain visible. When no crew is active, only unscoped rows are returned (backward compatible). Deduplication is per (skill, crew, content). Skill memories are FLEET_HOME-isolatable per-project (stored in the intel DB at $FLEET_HOME/intel/fleet-intel.db).

/fleet-intel remember <note>

Manually add notes to project memory:

/fleet-intel remember "This project uses pnpm, not npm — always use pnpm install"

/learn Command

Unified learning dashboard combining suggestions, memory, evolution, and stats:

Sub-commandDescription
/learn overviewSession insights + pending suggestions (default)
/learn steeringView extracted steering insights
/learn extractRun steering extraction on current conversation
/learn evolve <crew>Propose skill prompt improvements from telemetry
/learn dashboard <crew>Evolution metrics with before/after comparison
/learn shadowsList active shadow proposals with A/B stats
/learn eval <skill>Run A/B evaluation of current vs shadow prompt
/learn promote <skill>Promote a shadow to live
/learn reject <skill> [reason]Reject and discard a shadow
/learn apply <id>Apply a suggestion
/learn dismiss <id>Dismiss a suggestion
/learn clear-memoryReset the memory file

LoopScheduler Integration

The LoopScheduler triggers periodic memory consolidation prompts every 30 minutes during active sessions. This prompts the coordinator to reflect on what's been learned and update memory accordingly.

Skill Versioning

When a skill file is overwritten:

  • A backup of the previous version is saved automatically
  • The version number is auto-incremented
  • Rollback is possible by restoring the backup

/skill create

Generate skill templates:

bash
/skill create auth-checker --type reviewer --desc "Validates authentication patterns"

Creates a structured skill file with the specified type and description.

Configuration

Fleet Intelligence is always-on — no configuration needed to get started.

Storage Locations

PathPurpose
~/.fleet/intel/fleet-intel.dbSQLite database (sessions, runs, suggestions, shadows, steering)
.fleet/context/memory.mdAuto-memory (per-project, loaded into coordinator prompt)

Memory Integration

The .fleet/context/memory.md file is automatically loaded into the coordinator prompt by the ContextStore system (same mechanism used by /init context). No manual configuration required.