Question 38
Domain 2: Data ProcessingWhy is one-hot encoding usually a poor choice for a ZIP-code column with tens of thousands of distinct values?
Correct answer: A
Explanation
One-hot encoding is meant for low-cardinality categorical features, but a ZIP-code column has “tens of thousands of distinct values,” so it would create a very wide sparse matrix. That adds many mostly empty columns with little modeling value, making it inefficient and often unhelpful for learning.
Why each option is right or wrong
A. It creates a very wide sparse representation with limited modeling value
The exam objective explicitly flags one-hot encoding as appropriate only when the categorical domain is small enough to be represented efficiently; a ZIP-code field with tens of thousands of distinct categories explodes into tens of thousands of indicator columns. In practice that produces a mostly zero-valued sparse matrix, increasing memory and compute cost without adding much predictive signal, because adjacent ZIP codes are not inherently ordinal or meaningfully separable as individual dummy variables.
B. It prevents Spark from reading Delta tables
One-hot encoding affects feature representation, not Delta table reading.
C. It converts the column into a continuous feature automatically
One-hot encoding creates categorical indicator columns, not continuous features.
D. It makes missing values impossible to detect
Missing-value detection is separate from categorical encoding.