DEC-0018: Agent Project Management Framework

Decision

Adopt a project management framework purpose-built for AI agent teams, using prompts as the atomic work unit and layering adapted PM abstractions on top. The framework draws from Kanban (WIP limits, pull-based flow), Shape Up (appetite-based budgeting), Mission Command (intent + constraints), and high-reliability organization handoff protocols (medicine, aviation) rather than copying human-team Scrum/Jira models directly.

Context

The mystery-schools repo uses a four-agent model (Cursor, Claude Code, Perplexity, Codex) coordinated through a prompt relay system. As of PR-0134, the system has 125+ prompts, a telemetry harvester, and operational rules. The prompt relay already functions as a task management system — it has status lifecycle, ownership, priority, dependencies, timestamps, agent attribution, work classification, and cost estimation — but lacks the aggregation layer that turns individual prompts into legible project health, velocity, and trajectory.

The human (the "Operator") observes that agent teams behave similarly to human sprint teams — they work, they find interesting things, they sometimes underperform, they hand off and coordinate — but the underlying mechanics are fundamentally different, and the PM framework must account for this.

Fundamental Asymmetries Between Agent and Human Teams

1. Context Windows vs. Energy Curves

Humans get tired gradually. Agents hit hard context ceilings. An agent at 80% context is measurably degraded; a human at 80% energy can push through. Context is non-renewable within a session. This means capacity planning uses context budgets, not hours.

2. Cold Starts vs. Persistent Memory

A developer on day 60 has internalized the codebase. An agent on session 60 reconstructs understanding from written artifacts every time. The prompt relay is not coordination overhead — it is the agent's external memory. If the relay degrades, the agent degrades.

3. Perfect Audit Trails, Zero Intuition

Every tool call is logged. But an agent will never say "this feels wrong." It follows rules as written, not rules as internalized. Quality depends on rule quality, not developer maturity.

4. Cost Proportional to Consumption, Not Time

A developer costs $X/hour regardless of output. An agent costs per token. This inverts estimation: you don't estimate hours, you estimate context consumption.

5. Prompt Quality Determines Output Quality (Less Slack)

A vague Jira ticket can produce good work if the developer has domain knowledge. A vague prompt produces vague output. Prompt specification quality is a first-class project health metric.

6. Handoff Is Survival, Not Ceremony

When a human "hands off," they transfer ownership. When an agent hands off, they create the only record that allows the next session to function. Closer to a hospital shift-change chart than a Jira ticket reassignment.

7. Rules Are Enforceable, Not Aspirational

Adding a rule to the accountability compact produces immediate compliance. Process improvement is configuration, not culture change. But bad rules produce bad behavior with no pushback.

8. Instant Parallelism, Non-Linear Coordination Costs

You can spin up 4 agents in worktrees in seconds. But 4 agents touching adjacent files produce merge conflicts that cost more than the parallelism saved.

Adapted PM Abstractions

Traditional	Agent Adaptation	Key Difference
Sprint	Context cycle — bounded by aggregate context budget, not calendar	Constraint is physical (context windows), not temporal
Story points	Context budget — estimated % of context window consumed	Directly maps to cost and degradation risk
Velocity	Prompts completed per cycle, weighted by context consumption	Quality-adjusted: must exclude rejected prompts
Backlog	Prompt queue (already exists)	Prompt quality = backlog item quality; no slack for vague specs
Epic	Prompt group — explicit `epic` field or tag-based grouping	Needs first-class entity; `references` is too loose
Standup	Session pre-flight + context health report	Asynchronous, durable, data-driven
Retrospective	Friction analysis (`make telemetry-friction`)	Fully automatable from telemetry data
Sprint review	Verdict assessment (routing_verdict + output_verdict)	Aggregatable: "did the right agents get the right work?"
Scrum master	Rules system (.mdc files, AGENTS.md, accountability compact)	Static enforcement, not social persuasion
Sprint planning	Decompose backlog into context-window-sized prompts	The planning question: "does this fit in one session?"
Capacity planning	Available context windows × agent count per time period	Model-dependent: expensive models = fewer available sessions

Models Worth Adapting

Shape Up (Appetite Model)

Set a context budget before starting, not an estimate after scoping. If work exceeds its appetite, reshape or kill it — the problem is the specification, not the agent's speed. "Cooldown" periods between execution cycles = the operator reviewing, grooming, and writing the next batch of prompts.

Kanban (WIP Limits + Pull Flow)

WIP limits are physically enforced by agent architecture (one prompt in-progress per agent, one context window per session). Pull-based flow is natural: agents pull from the queue when ready. Lead time is the key metric.

Mission Command

Give agents clear intent and constraints; let them determine how to execute. A well-written prompt is a mission order. The accountability compact is the rules of engagement.

High-Reliability Organization Handoff Protocols

Medical SBAR (Situation, Background, Assessment, Recommendation) maps to prompt relay sections (Context, Task, Acceptance Criteria, Results). Research on structured handoffs in medicine, aviation, and nuclear operations shows that format consistency and degradation detection reduce errors. Agent context health tracking is the vitals check.

Metrics Unique to Agent PM

Context efficiency: output quality per unit of context consumed
Prompt specification quality: acceptance criteria count, context file coverage → correlated with output verdict
Handoff fidelity: does the receiving session reconstruct context accurately? (measurable via rejection rate on handoff prompts)
Routing accuracy: did the right agent/model get the work? (routing_verdict aggregated over time)
Session density: useful output per session vs. overhead (pre-prompting %, context reconstruction time)

Implementation Path

Perplexity researches HRO handoff protocols and PM model adaptation for agent teams (PR-0137)
Cursor builds the agent operations center as /internal/* routes in the existing Next.js site (PR-0138, depends on PR-0137)

Risks

Over-engineering the PM layer before the project needs it. The current prompt relay + telemetry harvester may be sufficient for months.
Abstracting too early: "sprints" and "epics" are only useful if the project has enough volume to make grouping valuable. With 4 agents and ~125 prompts, the overhead of formal sprint ceremonies may exceed the benefit.
The dashboard becoming the work instead of the work being the work.

Mitigation: build the operations center incrementally. Start with read-only views (health, flow metrics, prompt browser). Add management actions (claim/complete from the UI) only when the CLI becomes friction.