Fleet Intelligence
Fleet Intelligence is the self-improvement infrastructure for Agents Fleet. It continuously learns from your sessions, identifies patterns, generates data-driven suggestions, and evolves skill prompts over time.
Overview
Fleet Intelligence operates as a multi-phase pipeline:
Observe → Analyze → Suggest → Evolve → Verify → Assist- Observe — Captures session telemetry (agent runs, token usage, errors, skill activations)
- Analyze — Aggregates data across sessions to surface patterns and trends
- Suggest — Generates actionable suggestions based on statistical analysis
- Evolve — Shadow evolution proposes, evaluates, and promotes skill prompt improvements
- Verify — Deterministic verification and outcome backfill validate changes
- Assist — Provides auto-memory, steering extraction, learning dashboards, and prompt-level guidance
Key principles:
- Local-only — All data stays on your machine at
~/.fleet/intel/. Nothing is sent externally. - Human-in-the-loop — Suggestions and shadow proposals require explicit approval before being applied.
- Always-on — No configuration needed. Intel collection starts automatically with every session.
Architecture
graph TD
User -->|input| REPL
REPL --> CoordinatorEngine
CoordinatorEngine --> SDKSession["SDK Session"]
SDKSession --> EventListeners
EventListeners --> FleetStateStore["FleetStateStore (events)"]
FleetStateStore --> UIDisplay["UI/Display"]
FleetStateStore --> IntelCollector
IntelCollector --> IntelDB["IntelDatabase<br/>(~/.fleet/intel/fleet-intel.db)"]
IntelDB --> SuggestionEngine
IntelDB --> OutcomeBackfiller
IntelDB --> SkillEvolver
IntelDB --> EvalRunner
SuggestionEngine --> IntelDB
SkillEvolver -->|shadow proposals| IntelDB
EvalRunner -->|eval results| IntelDB
OutcomeBackfiller -->|merge/revert data| IntelDB
IntelDB --> formatIntelContext["formatIntelContext()"]
formatIntelContext --> CoordinatorPrompt["Coordinator Prompt<br/><fleet-intelligence> section"]
style IntelCollector fill:#2d6a4f,stroke:#1b4332,color:#fff
style SuggestionEngine fill:#2d6a4f,stroke:#1b4332,color:#fff
style IntelDB fill:#264653,stroke:#2a9d8f,color:#fff
style formatIntelContext fill:#e76f51,stroke:#f4a261,color:#fff
style SkillEvolver fill:#e76f51,stroke:#f4a261,color:#fff
style EvalRunner fill:#e76f51,stroke:#f4a261,color:#fff
style OutcomeBackfiller fill:#e76f51,stroke:#f4a261,color:#fffStorage
All intelligence data is stored in a single SQLite database at ~/.fleet/intel/fleet-intel.db, managed by the IntelDatabase class:
- Engine:
better-sqlite3(synchronous, in-process) - Journal mode: WAL (concurrent reads during writes)
- Schema: Foreign keys enabled, auto-migration on startup
- Access pattern: All queries use prepared statements (40+ registered in
prepareStatements()) - Legacy migration: On first run, existing JSON files from
~/.fleet/intel/*.jsonare migrated automatically viamigrateJsonToSqlite()
Data Model
All types are defined in src/intel/types.ts.
SessionRecord
One record per CLI session. Captures the full picture of what happened:
| Field | Type | Description |
|---|---|---|
version | 1 | Schema version |
sessionId | string | Unique session identifier |
startedAt | number | Session start timestamp (epoch ms) |
endedAt | number? | Session end timestamp |
model | string | Primary model used |
cwd | string | Working directory basename (PII: no full paths) |
activeCrew | string? | Active crew name, if any |
totalTokens | { input: number; output: number } | Aggregate token consumption |
taskCount | number | Number of tasks created |
taskCompletedCount | number | Tasks that completed successfully |
taskFailedCount | number | Tasks that failed |
agentRuns | AgentRunRecord[] | All agent executions in the session |
errors | ErrorRecord[] | All errors encountered |
skillUsage | SkillUsageRecord[]? | Crew/skill activations |
AgentRunRecord
Per-agent-run telemetry with outcome tracking:
| Field | Type | Description |
|---|---|---|
agentId | string | Worker identifier |
agentType | string | explorer, coder, reviewer, tester, general-purpose |
taskId | string? | Task identifier |
startedAt | number | Run start timestamp (epoch ms) |
endedAt | number? | Run end timestamp |
durationMs | number | Run duration in milliseconds |
status | string | completed, failed |
tokens | { input: number; output: number } | Tokens consumed by this agent |
toolUseCount | number | Number of tool invocations |
topTools | string[] | Top 3 tools by invocation count |
errorSummary | string? | Redacted error summary (max 200 chars) |
worktreePath | string? | Git worktree path (if applicable) |
model | string? | Model used for this agent |
taskSubject | string? | What the agent was working on (PII-redacted) |
commitSha | string? | Git commit SHA (for outcome correlation) |
branchName | string? | Git branch name (for outcome correlation) |
skillName | string? | Skill name used for this run |
crewName | string? | Active crew during this run |
outcomeMerged | boolean? | Whether changes were merged to main |
outcomeMergedAt | number? | When the merge happened (epoch ms) |
outcomeRevertedWithinDays | number? | Days until revert (undefined = not reverted) |
outcomeDodAllPassed | boolean? | Whether all DoD items passed |
outcomeAttachedAt | number? | When outcome was attached (epoch ms) |
ErrorRecord
Classified error tracking:
| Field | Type | Description |
|---|---|---|
timestamp | number | When the error occurred |
agentId | string | Which agent encountered the error |
agentType | string | Agent type |
errorType | ErrorType | Error classification |
message | string | Redacted error message (max 200 chars) |
Error types: rate_limit, tool_failure, timeout, permission, model_error, unknown
SkillUsageRecord
Tracks crew and skill activations:
| Field | Type | Description |
|---|---|---|
skillName | string | Name of the skill |
crewName | string? | Crew name (if activated via crew) |
activatedAt | number | Activation timestamp |
Suggestion
Generated by the SuggestionEngine:
| Field | Type | Description |
|---|---|---|
id | string | Unique suggestion identifier |
type | SuggestionType | classifier, decomposition, resource |
title | string | Human-readable summary |
description | string | Detailed explanation |
evidence | string | Statistical backing data |
confidence | number | 0–100 confidence score |
createdAt | number | When the suggestion was generated |
applied | boolean | Whether the user has applied this suggestion |
appliedAt | number? | When it was applied |
dismissed | boolean? | Whether the user has dismissed this suggestion |
ShadowRecord
Tracks shadow evolution candidates (see Shadow Evolution):
| Field | Type | Description |
|---|---|---|
id | string | Unique shadow identifier |
skillName | string | Target skill |
proposedVersion | string | Proposed new version |
currentVersion | string | Current live version |
patch | string | Text to append to skill prompt |
channel | EvolutionChannel | steering, error_pattern, success_replication, manual |
confidence | number | 0–100 confidence score |
evidence | string[] | Reasons for this proposal |
createdAt | number | Creation timestamp |
promotedAt | number? | When promoted to live |
rejectedAt | number? | When rejected |
rejectionReason | string? | Why it was rejected |
shadowRuns | number | Total A/B evaluation runs |
shadowWins | number | Runs where shadow outperformed current |
shadowLosses | number | Runs where current outperformed shadow |
shadowTies | number | Inconclusive runs |
evalScore | number? | Aggregate eval score |
evalRuns | number | Number of eval runs completed |
SteeringInsight
Auto-extracted user preferences from conversation history:
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier |
sessionId | string | Source session |
category | SteeringCategory | preference, prohibition, correction, convention, tool_directive |
rawMessage | string | PII-redacted original user message |
extractedInsight | string | The actionable learning |
confidence | number | 0–100 confidence score |
createdAt | number | Epoch ms |
persisted | boolean | Whether written to memory.md |
insightHash | string | SHA-256 hash for dedup |
Phase 1: Observe
The IntelCollector subscribes to FleetStateStore events and captures telemetry in real time.
What It Captures
- Agent spawned — type, model, task subject, skill/crew context
- Agent completed/failed — duration, tokens (input + output), status, error info, commit SHA
- Token usage — per-agent and per-model consumption (split by input/output)
- Errors — classified by type with redacted messages
- Skill activations — crew/skill name, activation time
PII Redaction
All data passes through the PiiRedactor before storage:
- API keys and tokens →
[REDACTED_KEY] - File paths → normalized (home directory →
~) - Email addresses →
[REDACTED_EMAIL] - URLs with credentials → credentials stripped
Storage
- Periodic flush: every 30 seconds during active sessions
- Finalize: full flush on session end
- Database: SQLite at
~/.fleet/intel/fleet-intel.db - Pruning: automatic at startup — 90 days max age
Example Session Record
{
"version": 1,
"sessionId": "sess_abc123",
"startedAt": 1714400000000,
"endedAt": 1714403600000,
"model": "claude-opus-4.6",
"cwd": "my-project",
"totalTokens": { "input": 180000, "output": 65000 },
"taskCount": 8,
"taskCompletedCount": 7,
"taskFailedCount": 1,
"agentRuns": [
{
"agentId": "worker-1",
"agentType": "coder",
"startedAt": 1714400100000,
"endedAt": 1714400145000,
"durationMs": 45000,
"status": "completed",
"tokens": { "input": 24000, "output": 8000 },
"toolUseCount": 12,
"topTools": ["edit", "view", "powershell"],
"model": "claude-sonnet-4.5",
"taskSubject": "Implement auth middleware",
"worktreePath": ".worktrees/worker-1",
"commitSha": "a1b2c3d",
"branchName": "worktree/worker-1"
}
],
"errors": [],
"skillUsage": []
}Phase 2: Analyze
The analyze phase provides commands to query and explore your fleet's historical data.
Commands
/fleet-intel summary
High-level overview across all recorded sessions:
📊 Fleet Intelligence Summary
─────────────────────────────
Sessions: 47
Success Rate: 89.4%
Total Tokens: 12,450,000
Avg Duration: 42m 15s
Top Errors: rate_limit (12), tool_failure (8), timeout (3)/fleet-intel agents
Per agent-type statistics:
🤖 Agent Type Stats
───────────────────
Type Runs Failures Avg Duration Tokens
coder 134 8 3m 20s 2,100k
explorer 89 2 1m 45s 890k
reviewer 67 1 2m 10s 670k
tester 45 5 4m 30s 450k
general 23 3 5m 15s 340k/fleet-intel failures
Top error types ranked by frequency:
❌ Failure Analysis
───────────────────
Type Count % of Total
rate_limit 12 37.5%
tool_failure 8 25.0%
timeout 3 9.4%
permission 2 6.3%
model_error 1 3.1%
unknown 4 12.5%/fleet-intel tokens
Token usage sorted by agent type:
📈 Token Usage by Agent Type
────────────────────────────
Type Total Tokens Avg/Run % of Total
coder 2,100,000 15,672 47.2%
explorer 890,000 9,888 20.0%
reviewer 670,000 10,000 15.1%
tester 450,000 10,000 10.1%
general 340,000 14,783 7.6%/fleet-intel stats
Top 5 sessions by token usage plus usage trends:
📊 Session Stats
────────────────
Top 5 Sessions by Token Usage:
1. sess_abc123 — 245,000 tokens (42m, 8 tasks)
2. sess_def456 — 198,000 tokens (35m, 6 tasks)
...
Token Usage Trends:
Recent (7d): 850,000 tokens across 12 sessions
Previous (7d): 1,200,000 tokens across 15 sessions
Change: -29.2% ↓/fleet-intel search <query>
Substring search across sessions — matches against errors, agent types, models, and task subjects:
🔍 Search: "rate_limit"
──────────────────────
Found 12 matches across 8 sessions:
sess_abc123: 3 rate_limit errors (claude-opus-4.6)
sess_def456: 2 rate_limit errors (claude-opus-4.6)
.../fleet-intel skills
Skill and crew usage summary:
🎯 Skill Usage
──────────────
Skill Uses Last Used
code-review 15 2h ago
init-investigation 8 1d ago
feature-planning 6 3d ago
research 4 5d agoPhase 3: Suggest
The SuggestionEngine runs three statistical analyzers against your session history to generate actionable suggestions.
Analyzers
Classifier Analyzer
Flags agent types with a success rate below 60% (minimum 10 runs required for statistical significance):
"Your
testeragents have a 45% success rate across 22 runs. Consider breaking test tasks into smaller scopes or switching to a more capable model."
Decomposition Analyzer
Correlates the number of workers spawned with task completion rates:
"Sessions with 4–6 workers have 92% task completion vs 71% for sessions with 8+ workers. Consider limiting parallelism for complex tasks."
Resource Analyzer
Identifies cases where expensive models achieve similar success rates to cheaper alternatives:
"claude-opus-4.6 and claude-sonnet-4.5 have similar success rates for
explorertasks (94% vs 91%), but Opus uses 2.3x more tokens. Consider using Sonnet for exploration."
Context Injection
The top 3 pending suggestions (by confidence) are automatically included in the coordinator's prompt within a <fleet-intelligence> section. This gives the coordinator awareness of patterns without requiring user action.
Shadow Evolution
Shadow evolution lets you safely test skill prompt improvements before committing them. Instead of directly modifying skill files, changes are proposed as shadow records that accumulate A/B evaluation data.
Pipeline
SkillEvolver → Shadow Proposal → EvalRunner A/B → EvalJudge → Human DecisionSkillEvolver analyzes telemetry for a crew's skills and proposes prompt improvements via four channels:
steering— insights extracted from user corrections/preferenceserror_pattern— recurring error patterns suggest prompt additionssuccess_replication— high-performing runs inform what worksmanual— user-initiated changes
Shadow proposals are stored in
IntelDatabasewith the proposed patch, confidence score, and evidence.EvalRunner runs A/B comparisons: current skill prompt vs shadow prompt on the same task. For coders, each gets an isolated worktree. Both outputs pass through VerifierNode (deterministic commands). If verification alone is inconclusive, EvalJudge spawns an LLM judge agent.
Results accumulate as
shadowWins,shadowLosses,shadowTies. When enough data exists, the user can promote or reject.
Commands
/learn evolve <crew> # Generate shadow proposals from telemetry
/learn shadows # List active shadows with win/loss stats
/learn eval <skill> # Run A/B evaluation (uses SelectMenu picker if no arg)
/learn promote <skill> # Promote shadow to live (overwrites skill file)
/learn reject <skill> # Reject and discard shadowConfiguration
{
"evolution": {
"enabled": true,
"autoEvolve": false,
"minSampleSize": 5,
"confidenceThreshold": 75,
"rollbackAfterSessions": 5
},
"shadow": {
"minShadowRuns": 5,
"winRateThreshold": 0.6
}
}Outcome Backfill
The OutcomeBackfiller retroactively checks whether agent-produced changes stuck by querying git history:
- Was the commit merged to main? →
outcomeMerged - Was the commit reverted? →
outcomeRevertedWithinDays - Did all DoD items pass? →
outcomeDodAllPassed
This data enables effectiveStatus — a run that "completed" but was later reverted is downgraded. The backfiller runs periodically with configurable grace period (default 24h) and revert window (default 7 days).
Outcome data feeds back into the dashboard's gatePassedButReverted metric, helping identify skills that pass tests but produce low-quality changes.
Eval Framework
VerifierNode
Deterministic, non-LLM command executor. Runs whitelisted shell commands (npm, npx, vitest, jest, tsc, eslint, etc.) and returns structured results:
interface VerifierResult {
command: string;
args: string[];
exitCode: number;
stdout: string; // truncated to 10KB
stderr: string;
durationMs: number;
passed: boolean; // exitCode === 0
}Default timeout: 2 minutes. Output capped at 10KB.
EvalRunner
Orchestrates A/B comparisons between current and shadow skill prompts:
- Both versions run the same task (coder agents get isolated worktrees)
- Both outputs pass through VerifierNode
- If verification is inconclusive, EvalJudge provides LLM-based comparison
Results: shadow_wins, current_wins, tie, both_fail, error.
EvalJudge
Spawns an impartial LLM judge agent that compares two outputs on correctness, quality, and completeness. Returns a_wins, b_wins, or tie with a one-sentence reason.
EvalScheduler
Periodic background evaluator. Runs every 30 minutes (configurable), iterates over all active shadows, and runs eval passes automatically. Integrates with the LoopScheduler lifecycle.
Steering Extraction
The SteeringExtractor uses LLM analysis at session end to discover persistent user preferences from conversation history.
Categories
| Category | Example |
|---|---|
preference | "Always use pnpm, not npm" |
prohibition | "Never modify the migrations directory" |
correction | "The API uses v2 endpoints, not v1" |
convention | "Use kebab-case for file names" |
tool_directive | "Run tests with --runInBand" |
Pipeline
- Last N messages (default 50) are sent to an LLM for analysis
- Insights are extracted with category and confidence score
- Duplicates are detected via SHA-256 hash of normalized text
- Insights above the memory threshold (default 70%) are auto-appended to
.fleet/context/memory.md - All insights are stored in IntelDatabase for querying
Configuration
{
"steering": {
"enabled": true,
"timeoutMs": 60000,
"maxInsights": 10,
"maxMessages": 50,
"memoryThreshold": 70
}
}Phase 4: Assist
The assist phase provides active learning and memory features.
Auto-Memory
When a session completes with ≥3 tasks finished, a session summary is automatically appended to .fleet/context/memory.md:
## Session 2026-04-29 — Auth Module Refactor
- Completed 7/8 tasks in 42 minutes
- Key learnings: JWT middleware needed custom error handler
- Model: claude-opus-4.6, Tokens: 245,000This memory file is auto-loaded into the coordinator prompt via ContextStore, giving future sessions access to past learnings.
Skill Memories (crew-scoped)
The SkillMemoryStore (src/intel/SkillMemoryStore.ts) persists short textual memories keyed by skill name and optionally scoped to a crew. When a crew is active, memories are stored with that crew's name and queries return crew-matched rows PLUS legacy unscoped (crew_name IS NULL) rows — so pre-existing memories remain visible. When no crew is active, only unscoped rows are returned (backward compatible). Deduplication is per (skill, crew, content). Skill memories are FLEET_HOME-isolatable per-project (stored in the intel DB at $FLEET_HOME/intel/fleet-intel.db).
/fleet-intel remember <note>
Manually add notes to project memory:
/fleet-intel remember "This project uses pnpm, not npm — always use pnpm install"/learn Command
Unified learning dashboard combining suggestions, memory, evolution, and stats:
| Sub-command | Description |
|---|---|
/learn overview | Session insights + pending suggestions (default) |
/learn steering | View extracted steering insights |
/learn extract | Run steering extraction on current conversation |
/learn evolve <crew> | Propose skill prompt improvements from telemetry |
/learn dashboard <crew> | Evolution metrics with before/after comparison |
/learn shadows | List active shadow proposals with A/B stats |
/learn eval <skill> | Run A/B evaluation of current vs shadow prompt |
/learn promote <skill> | Promote a shadow to live |
/learn reject <skill> [reason] | Reject and discard a shadow |
/learn apply <id> | Apply a suggestion |
/learn dismiss <id> | Dismiss a suggestion |
/learn clear-memory | Reset the memory file |
LoopScheduler Integration
The LoopScheduler triggers periodic memory consolidation prompts every 30 minutes during active sessions. This prompts the coordinator to reflect on what's been learned and update memory accordingly.
Skill Versioning
When a skill file is overwritten:
- A backup of the previous version is saved automatically
- The version number is auto-incremented
- Rollback is possible by restoring the backup
/skill create
Generate skill templates:
/skill create auth-checker --type reviewer --desc "Validates authentication patterns"Creates a structured skill file with the specified type and description.
Configuration
Fleet Intelligence is always-on — no configuration needed to get started.
Storage Locations
| Path | Purpose |
|---|---|
~/.fleet/intel/fleet-intel.db | SQLite database (sessions, runs, suggestions, shadows, steering) |
.fleet/context/memory.md | Auto-memory (per-project, loaded into coordinator prompt) |
Memory Integration
The .fleet/context/memory.md file is automatically loaded into the coordinator prompt by the ContextStore system (same mechanism used by /init context). No manual configuration required.