Question 11
Domain 1: Data Preparation for Machine Learning (ML)A machine learning team is preparing a training dataset and wants to evaluate whether required fields are populated, values follow expected formats across records, recorded values correctly reflect the real-world source, and duplicate records are identified. Which set of data quality dimensions BEST matches these goals?
Correct answer: A
Explanation
Data quality checks commonly map to four distinct dimensions: completeness verifies required data is present, consistency verifies uniformity across records, accuracy verifies correctness against reality, and deduplication identifies repeated records. — Source material: Data quality: completeness, consistency, accuracy, deduplication. AWS Glue Data Quality.
Why each option is right or wrong
A. Completeness, consistency, accuracy, and deduplication
The stated goals align directly to the named dimensions in the source: populated required fields map to completeness, expected formats across records map to consistency, correctness relative to the source maps to accuracy, and identifying repeated records maps to deduplication.
B. Scalability, consistency, latency, and deduplication
Scalability and latency are system performance characteristics, not the listed data quality dimensions.
C. Completeness, availability, accuracy, and encryption
Availability and encryption are service reliability and security concerns, not the listed data quality dimensions.
D. Consistency, normalization, partitioning, and deduplication
Normalization and partitioning are data design or storage techniques, not the listed quality dimensions.