Convoy Stability Roadmap
How to get from where we are to the target UX, while preserving existing workflows and fixing the reliability problems people actually hit.
Current state
Milestone 0 complete -- all foundation PRs merged.
Workflows to preserve
Workflow A: Manual bead creation + batch sling
The most common pattern today:
bd create --type=task "Fix auth timeout" → sh-task-1
bd create --type=task "Add validation" → sh-task-2
bd create --type=task "Integration tests" → sh-task-3
bd dep add sh-task-2 sh-task-1 --type=blocks
gt sling sh-task-1 sh-task-2 sh-task-3 gastown
What happens today (with PR #1759):
- Batch sling creates one convoy tracking all 3 tasks
- Rig is auto-resolved from bead prefixes (explicit rig is deprecated)
- Tasks sling sequentially with 2s delays, sharing 1 convoy
blocksdeps are respected by the daemon feeder — sh-task-2 won't be fed by the daemon until sh-task-1 closes (but initial dispatch sends all tasks regardless of deps)
What people expect:
- Tasks dispatch in dependency order
- Tasks that are blocked don't get slung until their blockers close
- Completed tasks land on the target branch through the refinery
Workflow B: design-to-beads + manual sling
/design-to-beads PRD.md
→ creates: root epic, sub-epics, leaf tasks
→ adds: parent-child deps (organizational hierarchy)
→ adds: blocks deps (execution ordering between tasks)
gt sling <task1> <task2> <task3> gastown
Same outcome as Workflow A: one shared convoy, blocks deps respected
by the daemon feeder. The epic and sub-epic structure exists in beads
and affects daemon-driven feeding (epics are filtered by IsSlingableType,
blocked tasks wait for their blockers to close).
Workflow C: Manual convoy creation
gt convoy create "Auth overhaul" sh-task-1 sh-task-2 sh-task-3
gt sling sh-task-1 gastown
→ witness feeds sh-task-2 when sh-task-1 closes (serial)
→ witness feeds sh-task-3 when sh-task-2 closes (serial)
→ convoy auto-closes when all 3 are done
This works on upstream/main but is serial (one task at a time) and the witness feed ignores blocks deps, type filters, and rig capacity.
Target UX
The ideal experience, achievable at the end of this roadmap:
/design-to-beads PRD.md
→ creates: root epic → sub-epics → leaf tasks
→ adds: parent-child (hierarchy) + blocks (ordering) deps
→ sub-epics get integration branches
gt convoy stage <epic-id>
→ walks DAG, validates structure, displays route plan (tree + waves)
→ creates staged convoy tracking all beads
gt convoy launch <convoy-id>
→ activates convoy, dispatches Wave 1 tasks
→ daemon feeds subsequent waves as tasks close
→ sub-epic status auto-managed (open → in_progress → closed)
→ when sub-epic closes: sling sub-epic with review formula
→ review formula examines accumulated changes on integration branch
→ on approval: integration branch lands to main/parent branch
→ convoy closes when root epic closes
What people actually report as broken
The most common complaint: tasks don't make it through the refinery and land on the target branch. This is NOT a convoy problem — it's a sling→done→refinery pipeline reliability problem. The convoy system layers on top of this pipeline.
Critical failure points (independent of convoys)
| # | Failure | Where | Severity | Recovery |
|---|---|---|---|---|
| 1 | done.go | Resolved | Eliminated by all-on-main architecture (no per-polecat Dolt branches). | |
| 2 | Push fails (all 3 tiers) | done.go:531-572 | Critical | Commits local-only. Worktree preserved. Manual recovery required. |
| 3 | MR bead creation fails | done.go:744-752 | High | Branch pushed but no MR. Witness notified. No auto-recovery. |
| 4 | Refinery never wakes (agent stall) | Agent-level | High | Heartbeat restarts, but gap can be minutes. |
| 5 | Merge conflict blocks MR indefinitely | engineer.go:764-786 | Medium | Conflict task must be dispatched + resolved. Stalls if rig at capacity. |
| 6 | Orphaned MR (branch deleted, MR still open) | engineer.go:1086-1198 | Medium | Anomaly detection finds it. Agent must act. |
These failures affect ALL polecat work, not just convoy-tracked work. Fixing them benefits the entire system.
Convoy-specific failure points
| # | Failure | Fixed by | Status |
|---|---|---|---|
| 7 | Blocked tasks get slung (blocks deps ignored) | isIssueBlocked | PR #1759 (open) |
| 8 | Epics get slung to polecats (no type filter) | IsSlingableType | PR #1759 (open) |
| 9 | Cross-rig close events invisible to daemon | Multi-rig SDK polling | Merged |
| 10 | Daemon doesn't feed next task after close | Continuation feeding | Merged |
| 11 | Refinery convoy check passes wrong path (never works) | Call removed | Merged |
| 12 | First dispatch failure abandons entire convoy | Dispatch failure iteration | PR #1759 (open) |
| 13 | Stranded scan is reporting-only, doesn't auto-dispatch | feedFirstReady | Merged |
Phased plan
Milestone 0: Land the foundation
Status: Complete.
Milestone 1: Pipeline reliability (independent of convoys)
Goal: Fix the sling→done→refinery pipeline failures that cause "tasks don't land" complaints.
This is the highest-impact work for user-reported problems. Convoys can't deliver if the underlying pipeline drops tasks.
Work items:
| # | Problem | Proposed fix | Complexity |
|---|---|---|---|
| 1a | Resolved — all-on-main eliminates per-polecat Dolt branches. | N/A | |
| 1b | Resolved — no per-polecat Dolt branches to strand on. | N/A | |
| 1c | Refinery agent stall | Harden refinery heartbeat. Add a daemon-level MR queue monitor that nudges (or restarts) the refinery when MRs sit unprocessed beyond a threshold. | Medium |
| 1d | Merge conflicts block indefinitely | Track conflict task age. If unresolved after N hours, escalate to Mayor/owner with the specific conflict details. | Low |
This milestone is independent of convoy work. It can be done in parallel by a different contributor, or sequenced after Milestone 0.
Milestone 2: Stage and launch (gt convoy stage, gt convoy launch)
Goal: Enable the /design-to-beads → gt convoy stage → gt convoy launch workflow.
Depends on: Milestone 0 (the feeder must respect blocks deps and filter types for staged convoys to work correctly).
What ships (from Phase 2 PRD):
gt convoy stage <bead-id>— DAG walking, validation, wave computation, tree + wave route plan displaygt convoy launch <convoy-id>— activates convoy, dispatches Wave 1- Epic status management (open → in_progress → closed)
- Integration branch awareness (warnings when missing)
- Staged status transitions (staged_ready ↔ staged_warnings → open)
Key design decisions already made:
parent-childis organizational only, never blocking (aligned withbd readyand beads SDK)- Execution ordering is via explicit
blocksdeps - Wave computation is informational (display only), runtime dispatch uses
per-cycle
isIssueBlockedchecks - Integration branch creation and landing remain manual (or refinery auto-land)
What this enables for Workflow B:
/design-to-beads PRD.md
gt convoy stage <root-epic-id>
→ see tree view + wave view
→ see warnings (missing integration branch, parked rigs, etc.)
gt convoy launch <convoy-id>
→ Wave 1 tasks dispatched automatically
→ subsequent waves fed by daemon as tasks close
→ epic statuses update as children progress
→ convoy closes when root epic closes
What it does NOT enable yet:
- Sub-epic review formula (see Milestone 3)
- Auto-formula detection for epic slinging (Phase 3)
- Coordinator polecat (Phase 3)
Milestone 3: Sub-epic review gate
Goal: When all tasks under a sub-epic complete and merge into the sub-epic's integration branch, automatically trigger a comprehensive review of the accumulated changes before landing.
This is the missing piece between "tasks merge to integration branch" and "integration branch lands to main."
Current state: Integration branch landing is purely mechanical — all children closed + all MRs merged = ready to land. There is no review step that examines the combined diff.
Proposed mechanism:
-
Sub-epic completion trigger: When the convoy's epic status management (Milestone 2 US-014) closes a sub-epic, instead of (or before) auto-landing, sling the sub-epic itself with a review formula.
-
Review formula: A new formula (e.g.,
mol-integration-reviewor adaptcode-review.formula.toml) that:- Checks out the integration branch
- Computes the full diff against the base branch
- Reviews the accumulated changes for:
- Cross-task consistency
- API contract violations between tasks
- Missing tests for combined functionality
- Merge conflict residue
- Produces a review report
- If approved: runs
gt mq integration land <sub-epic-id> - If rejected: creates a fix task, blocks the sub-epic on it
-
Convoy awareness: The convoy stays open while the review runs. The review polecat's completion triggers the next sub-epic (if the root epic has
blocksdeps between sub-epics) or the root epic closure.
Integration points:
internal/convoy/operations.go— after closing an epic, check if it has an integration branch. If yes, sling with review formula instead of callinggt mq integration land.internal/daemon/convoy_manager.go— the event poll detects the review polecat's bead close, feeds the next sub-epic or closes the root epic.- New formula:
mol-integration-review.formula.toml
design-to-beads changes needed:
- Ensure sub-epics get integration branches (either design-to-beads
creates them, or
gt convoy stagecreates them at stage time) - Ensure
blocksdeps exist between sub-epics if sequential ordering is desired
Milestone 4: Advanced dispatch (Phase 3 PRD)
Goal: Pluggable dispatch strategies and coordinator polecats.
What ships:
FeederStrategyinterface- Hierarchy depth validation (opt-in)
- Auto-generate
blocksdeps from hierarchy (--infer-blocks) - Auto-formula detection in
gt sling(epic → coordinator formula) - Coordinator polecat strategy
- Dynamic DAG decomposition
This milestone is the furthest out and the least urgent. The default dispatch strategy (Phase 1 feeder with blocks checking) covers the common case. The coordinator polecat is for complex epics where AI-driven task selection outperforms static dependency ordering.
Dependency graph
Milestone 0: Foundation ← rewrite MERGED, safety guards in PR [#1759](https://github.com/steveyegge/gastown/pull/1759)
│
├──────────────────────────┐
│ │
v v
Milestone 1: Pipeline Milestone 2: Stage/Launch
(done/refinery fixes) (gt convoy stage/launch)
│ │
│ v
│ Milestone 3: Sub-epic review gate
│ │
└──────────┬───────────────┘
│
v
Milestone 4: Advanced dispatch
Milestones 1 and 2 are independent and can run in parallel. Milestone 3 depends on Milestone 2 (needs epic status management). Milestone 4 depends on both 2 and 3 being stable.
What design-to-beads needs to change
The current design-to-beads plugin creates the right structure (epics with parent-child deps, tasks with blocks deps). For the staged convoy workflow, it needs:
| Change | When needed | Who |
|---|---|---|
Create blocks deps between sub-epics (not just between tasks) | Milestone 2 | design-to-beads plugin |
| Create integration branches for sub-epics | Milestone 3 | design-to-beads plugin or gt convoy stage |
Output the root epic ID for gt convoy stage input | Milestone 2 | design-to-beads plugin |
The current plugin already creates blocks deps between tasks. The gap is
inter-sub-epic ordering: if Sub-Epic A should complete before Sub-Epic B
starts, a blocks dep between them (or between A's last task and B's
first task) must exist.
If design-to-beads doesn't create inter-sub-epic blocks deps, gt convoy stage will show them dispatching in parallel (Wave 1), which may or may
not be desired. The --infer-blocks flag (Milestone 4) can auto-generate
these from creation order, but explicit deps from the PRD structure are
more reliable.
Summary: what to do next
-
Now: Get PR #1759 (feeder safety guards) reviewed and merged to complete Milestone 0.
-
Next: Start Milestone 1 (pipeline reliability) and/or Milestone 2 (stage/launch) depending on priorities. Milestone 1 has broader impact (fixes "tasks don't land" for everyone). Milestone 2 enables the staged convoy UX. These can run in parallel.
-
After M2: Milestone 3 (sub-epic review gate) is the key piece connecting design-to-beads output to the full automated workflow.
-
Later: Milestone 4 (advanced dispatch) when the common case is stable.