Question 28
Content Domain 2: Exploratory Data AnalysisAn online reseller has a large, multi-column dataset with one column missing 30% of its data. A Machine Learning Specialist believes that certain columns in the dataset could be used to reconstruct the missing data. Which reconstruction approach should the Specialist use to preserve the integrity of the dataset?
Correct answer: C
Explanation
Multiple imputation is used when a dataset has missing values and other columns can help estimate them while preserving uncertainty. It creates several plausible replacements for the missing data and combines the results, which helps maintain the dataset’s integrity better than a single fill-in method.
Why each option is right or wrong
A. Listwise deletion.
Listwise deletion removes entire rows with missing values, wasting data and potentially biasing results.
B. Last observation carried forward.
Last observation carried forward fits ordered time-series data, not general multi-column tabular reconstruction.
C. Multiple imputation.
Multiple imputation is the appropriate method when missingness is substantial and other variables can inform the absent values, because it generates several plausible replacements rather than a single deterministic fill-in. Under the standard multiple-imputation workflow, the analyst creates multiple completed datasets, analyzes each, and then pools the estimates using Rubin’s rules, which preserves variance and avoids the bias introduced by single imputation when about 30% of a column is missing.
D. Mean substitution.
Mean substitution inserts one average value, reducing variance and weakening relationships between features.