Question 9
Domain 1: Developing Code for Data Processing using Python and SQLA structured streaming job writes to Delta through `foreachBatch`. The job may be restarted after failures, and duplicate output must be avoided. Which design is best?
Correct answer: D
Explanation
Structured streaming can reprocess a micro-batch after a restart, so duplicate writes must be prevented by making the sink idempotent. A stable checkpoint preserves progress, and using `MERGE` on a business key ensures repeated micro-batches update the same rows instead of inserting duplicates.
Why each option is right or wrong
A. Write in append mode without a checkpoint
Append without checkpoint loses progress tracking and can reinsert replayed micro-batches as duplicates.
B. Collect the micro-batch to the driver before writing
Collecting to the driver changes execution location, not duplicate-handling or restart safety.
C. Delete the target and replay the full source after every restart
Full delete-and-replay is expensive and unnecessary when idempotent incremental processing is available.
D. Use a stable checkpoint and make each micro-batch idempotent, for example with `MERGE` on a business key
Under Structured Streaming’s exactly-once processing model, the checkpoint directory stores the committed offsets and batch progress; if the job restarts with the same checkpoint, Spark resumes from the last successful micro-batch rather than starting over. In `foreachBatch`, however, the sink is user-managed and can be re-invoked for the same `batchId` after a failure, so Delta writes must be made idempotent; using `MERGE` keyed on a business key prevents duplicate inserts by updating the same target row when the same batch is replayed.