DE Professional Practice Q3

A. Restart the query with the same checkpoint location and idempotent sink behavior

Under Apache Spark Structured Streaming, checkpoint metadata stores the committed offsets and state for each micro-batch, so restarting with the same checkpoint directory allows the engine to resume from the last successful commit rather than replaying the entire stream. Because batch 87 had already been checkpointed, the restart picks up after that point; pairing this with an idempotent sink is the correct safeguard against duplicate output if any records are re-emitted during recovery, since exactly-once delivery is not guaranteed by the sink alone.

B. Delete the checkpoint and replay the source from the beginning

Deleting checkpoints discards saved progress and forces full replay, increasing duplicate processing risk.

C. Change both the checkpoint location and output table name

New checkpoint and output target create a fresh stream state instead of resuming prior progress.

D. Convert the streaming job to an all-purpose interactive cluster first

Cluster type changes compute environment, not the stream’s recovery state or checkpoint continuity.

Question 3

Explanation

Why each option is right or wrong