Question 4
Domain 1: Data Preparation for Machine Learning (ML)Case study - An ML engineer is developing a fraud detection model on AWS. The training dataset includes transaction logs, customer profiles, and tables from an on-premises MySQL database. The transaction logs and customer profiles are stored in Amazon S3. The dataset has a class imbalance that affects the learning of the model's algorithm. Additionally, many of the features have interdependencies. The algorithm is not capturing all the desired underlying patterns in the data. Before the ML engineer trains the model, the ML engineer must resolve the issue of the imbalanced data. Which solution will meet this requirement with the LEAST operational effort?
Correct answer: D
Explanation
Amazon SageMaker Data Wrangler includes a "balance data" operation that can rebalance a class-imbalanced dataset before training. Oversampling the minority class is a standard way to address imbalance with minimal operational effort because it is built into the data preparation workflow and avoids custom preprocessing code.
Why each option is right or wrong
A. Use Amazon Athena to identify patterns that contribute to the imbalance. Adjust the dataset accordingly.
Athena is mainly for querying and analyzing data, not streamlined class rebalancing for ML preprocessing.
B. Use Amazon SageMaker Studio Classic built-in algorithms to process the imbalanced dataset.
Built-in algorithms train models; they do not generally replace explicit dataset balancing preprocessing.
C. Use AWS Glue DataBrew built-in features to oversample the minority class.
DataBrew focuses on general data preparation, but Data Wrangler is the ML-oriented balancing workflow here.
D. Use the Amazon SageMaker Data Wrangler balance data operation to oversample the minority class.
Amazon SageMaker Data Wrangler provides a built-in balance data transformation that can rebalance a skewed target distribution before training, so no custom ETL or separate preprocessing job is needed. In this case, oversampling the minority class is the least-effort way to address the class imbalance because it directly increases representation of the rare fraud examples while preserving the existing S3 and MySQL data sources in the same workflow.