← Project Log
DecisionDEC-0018

Agent Project Management Framework

humanStatus: accepted

DEC-0018: Agent Project Management Framework

Decision

Adopt a project management framework purpose-built for AI agent teams, using prompts as the atomic work unit and layering adapted PM abstractions on top. The framework draws from Kanban (WIP limits, pull-based flow), Shape Up (appetite-based budgeting), Mission Command (intent + constraints), and high-reliability organization handoff protocols (medicine, aviation) rather than copying human-team Scrum/Jira models directly.

Context

The mystery-schools repo uses a four-agent model (Cursor, Claude Code, Perplexity, Codex) coordinated through a prompt relay system. As of PR-0134, the system has 125+ prompts, a telemetry harvester, and operational rules. The prompt relay already functions as a task management system — it has status lifecycle, ownership, priority, dependencies, timestamps, agent attribution, work classification, and cost estimation — but lacks the aggregation layer that turns individual prompts into legible project health, velocity, and trajectory.

The human (the "Operator") observes that agent teams behave similarly to human sprint teams — they work, they find interesting things, they sometimes underperform, they hand off and coordinate — but the underlying mechanics are fundamentally different, and the PM framework must account for this.

Fundamental Asymmetries Between Agent and Human Teams

1. Context Windows vs. Energy Curves

Humans get tired gradually. Agents hit hard context ceilings. An agent at 80% context is measurably degraded; a human at 80% energy can push through. Context is non-renewable within a session. This means capacity planning uses context budgets, not hours.

2. Cold Starts vs. Persistent Memory

A developer on day 60 has internalized the codebase. An agent on session 60 reconstructs understanding from written artifacts every time. The prompt relay is not coordination overhead — it is the agent's external memory. If the relay degrades, the agent degrades.

3. Perfect Audit Trails, Zero Intuition

Every tool call is logged. But an agent will never say "this feels wrong." It follows rules as written, not rules as internalized. Quality depends on rule quality, not developer maturity.

4. Cost Proportional to Consumption, Not Time

A developer costs $X/hour regardless of output. An agent costs per token. This inverts estimation: you don't estimate hours, you estimate context consumption.

5. Prompt Quality Determines Output Quality (Less Slack)

A vague Jira ticket can produce good work if the developer has domain knowledge. A vague prompt produces vague output. Prompt specification quality is a first-class project health metric.

6. Handoff Is Survival, Not Ceremony

When a human "hands off," they transfer ownership. When an agent hands off, they create the only record that allows the next session to function. Closer to a hospital shift-change chart than a Jira ticket reassignment.

7. Rules Are Enforceable, Not Aspirational

Adding a rule to the accountability compact produces immediate compliance. Process improvement is configuration, not culture change. But bad rules produce bad behavior with no pushback.

8. Instant Parallelism, Non-Linear Coordination Costs

You can spin up 4 agents in worktrees in seconds. But 4 agents touching adjacent files produce merge conflicts that cost more than the parallelism saved.

Adapted PM Abstractions

Traditional Agent Adaptation Key Difference
Sprint Context cycle — bounded by aggregate context budget, not calendar Constraint is physical (context windows), not temporal
Story points Context budget — estimated % of context window consumed Directly maps to cost and degradation risk
Velocity Prompts completed per cycle, weighted by context consumption Quality-adjusted: must exclude rejected prompts
Backlog Prompt queue (already exists) Prompt quality = backlog item quality; no slack for vague specs
Epic Prompt group — explicit epic field or tag-based grouping Needs first-class entity; references is too loose
Standup Session pre-flight + context health report Asynchronous, durable, data-driven
Retrospective Friction analysis (make telemetry-friction) Fully automatable from telemetry data
Sprint review Verdict assessment (routing_verdict + output_verdict) Aggregatable: "did the right agents get the right work?"
Scrum master Rules system (.mdc files, AGENTS.md, accountability compact) Static enforcement, not social persuasion
Sprint planning Decompose backlog into context-window-sized prompts The planning question: "does this fit in one session?"
Capacity planning Available context windows × agent count per time period Model-dependent: expensive models = fewer available sessions

Models Worth Adapting

Shape Up (Appetite Model)

Set a context budget before starting, not an estimate after scoping. If work exceeds its appetite, reshape or kill it — the problem is the specification, not the agent's speed. "Cooldown" periods between execution cycles = the operator reviewing, grooming, and writing the next batch of prompts.

Kanban (WIP Limits + Pull Flow)

WIP limits are physically enforced by agent architecture (one prompt in-progress per agent, one context window per session). Pull-based flow is natural: agents pull from the queue when ready. Lead time is the key metric.

Mission Command

Give agents clear intent and constraints; let them determine how to execute. A well-written prompt is a mission order. The accountability compact is the rules of engagement.

High-Reliability Organization Handoff Protocols

Medical SBAR (Situation, Background, Assessment, Recommendation) maps to prompt relay sections (Context, Task, Acceptance Criteria, Results). Research on structured handoffs in medicine, aviation, and nuclear operations shows that format consistency and degradation detection reduce errors. Agent context health tracking is the vitals check.

Metrics Unique to Agent PM

  • Context efficiency: output quality per unit of context consumed
  • Prompt specification quality: acceptance criteria count, context file coverage → correlated with output verdict
  • Handoff fidelity: does the receiving session reconstruct context accurately? (measurable via rejection rate on handoff prompts)
  • Routing accuracy: did the right agent/model get the work? (routing_verdict aggregated over time)
  • Session density: useful output per session vs. overhead (pre-prompting %, context reconstruction time)

Implementation Path

  1. Perplexity researches HRO handoff protocols and PM model adaptation for agent teams (PR-0137)
  2. Cursor builds the agent operations center as /internal/* routes in the existing Next.js site (PR-0138, depends on PR-0137)

Risks

  • Over-engineering the PM layer before the project needs it. The current prompt relay + telemetry harvester may be sufficient for months.
  • Abstracting too early: "sprints" and "epics" are only useful if the project has enough volume to make grouping valuable. With 4 agents and ~125 prompts, the overhead of formal sprint ceremonies may exceed the benefit.
  • The dashboard becoming the work instead of the work being the work.

Mitigation: build the operations center incrementally. Start with read-only views (health, flow metrics, prompt browser). Add management actions (claim/complete from the UI) only when the CLI becomes friction.

0:00
0:00