Question 10
Domain 1: Developing Code for Data Processing using Python and SQLFor a high-volume event stream, which combination best deduplicates records while keeping state bounded?
Correct answer: B
Explanation
A unique event ID lets the system identify and drop repeated records, since each event can be matched against prior IDs. A watermark on event time bounds state by allowing old IDs to expire once the stream has advanced past late-arrival tolerance, so deduplication does not grow without limit.
Why each option is right or wrong
A. dropDuplicates() with no event-time logic
Without event-time bounds, deduplication state can grow indefinitely in a long-running stream.
B. A unique event ID plus a watermark on event time
Under standard stream-processing semantics, deduplication requires a stable per-record key so the engine can compare each incoming event against previously seen items; a unique event ID satisfies that requirement, whereas payload-only matching is unreliable when fields repeat legitimately. State is kept bounded by an event-time watermark, which advances the system’s notion of completeness and permits eviction of IDs older than the allowed lateness window, so the deduplication store does not grow indefinitely as throughput increases.
C. A nightly VACUUM plus OPTIMIZE
VACUUM and OPTIMIZE are storage maintenance tasks, not streaming duplicate-detection mechanisms.
D. Randomly repartitioning the stream every micro-batch
Repartitioning changes data distribution, but does not identify or remove duplicate events.