Question 3
Domain 1: Developing Code for Data Processing using Python and SQLA streaming pipeline failed after checkpointing batch 87. Which restart strategy best preserves progress and minimizes duplicate processing?
Correct answer: A
Explanation
Using the same checkpoint location lets the query resume from the last committed state, so it continues after batch 87 instead of starting over. Idempotent sink behavior prevents duplicate writes if any records are reprocessed, which preserves progress and minimizes duplication.
Why each option is right or wrong
A. Restart the query with the same checkpoint location and idempotent sink behavior
Under Apache Spark Structured Streaming, checkpoint metadata stores the committed offsets and state for each micro-batch, so restarting with the same checkpoint directory allows the engine to resume from the last successful commit rather than replaying the entire stream. Because batch 87 had already been checkpointed, the restart picks up after that point; pairing this with an idempotent sink is the correct safeguard against duplicate output if any records are re-emitted during recovery, since exactly-once delivery is not guaranteed by the sink alone.
B. Delete the checkpoint and replay the source from the beginning
Deleting checkpoints discards saved progress and forces full replay, increasing duplicate processing risk.
C. Change both the checkpoint location and output table name
New checkpoint and output target create a fresh stream state instead of resuming prior progress.
D. Convert the streaming job to an all-purpose interactive cluster first
Cluster type changes compute environment, not the stream’s recovery state or checkpoint continuity.