DE Professional Practice Q10

A. dropDuplicates() with no event-time logic

Without event-time bounds, deduplication state can grow indefinitely in a long-running stream.

B. A unique event ID plus a watermark on event time

Under standard stream-processing semantics, deduplication requires a stable per-record key so the engine can compare each incoming event against previously seen items; a unique event ID satisfies that requirement, whereas payload-only matching is unreliable when fields repeat legitimately. State is kept bounded by an event-time watermark, which advances the system’s notion of completeness and permits eviction of IDs older than the allowed lateness window, so the deduplication store does not grow indefinitely as throughput increases.

C. A nightly VACUUM plus OPTIMIZE

VACUUM and OPTIMIZE are storage maintenance tasks, not streaming duplicate-detection mechanisms.

D. Randomly repartitioning the stream every micro-batch

Repartitioning changes data distribution, but does not identify or remove duplicate events.

Question 10

Explanation

Why each option is right or wrong