Question 35
Domain 5A multi-phase coordinator agent crashes after completing phase 2 of a 4-phase research pipeline. The pipeline must resume without restarting from the beginning. Which architecture enables reliable crash recovery?
Correct answer: C
Explanation
A structured state manifest preserves the pipeline’s progress by recording “completed work, partial results, [and] next phase inputs,” so a restart can resume from the last finished phase instead of rerunning earlier phases. Loading that manifest on startup provides durable checkpointing and crash recovery for multi-phase workflows.
Why each option is right or wrong
A. Rely on `--resume` to reload the conversation history and infer progress from prior messages
Conversation history is not a reliable execution checkpoint and may not unambiguously encode phase completion.
B. Implement session checkpointing with `fork_session` at the end of each phase
Session forking creates a branch of interaction context, not durable workflow state for restart recovery.
C. Have each phase export a structured state manifest (completed work, partial results, next phase inputs) to a known location, and have the coordinator load that manifest on startup to determine where to resume
The recovery mechanism here is checkpoint-based persistence: under standard workflow design, each phase must write durable state before advancing so a restart can reconstruct progress after a crash. By storing the phase-completion record, intermediate outputs, and the inputs needed for the next step in a known location, the coordinator can reload that checkpoint on startup and continue at phase 3 rather than re-executing phases 1–2.
D. Configure the coordinator with a high `max_tokens` budget so phases are less likely to be interrupted
Larger token budgets may reduce truncation risk but do not provide persisted progress after a crash.