Question 4
Content Domain 1: Data EngineeringA Machine Learning Specialist is developing a daily ETL workflow containing multiple ETL jobs. The workflow consists of the following processes: Start the workflow as soon as data is uploaded to Amazon S3. When all the datasets are available in Amazon S3, start an ETL job to join the uploaded datasets with multiple terabyte-sized datasets already stored in Amazon S3. Store the results of joining datasets in Amazon S3. If one of the jobs fails, send a notification to the Administrator. Which configuration will meet these requirements?
Correct answer: A
Explanation
AWS Lambda can start the workflow from an Amazon S3 upload event, and AWS Step Functions can coordinate the ETL steps and wait until all datasets are available. AWS Glue is built for ETL on large data in Amazon S3, and an Amazon CloudWatch alarm can trigger SNS to notify the Administrator if a job fails.
Why each option is right or wrong
A. Use AWS Lambda to trigger an AWS Step Functions workflow to wait for dataset uploads to complete in Amazon S3. Use AWS Glue to join the datasets. Use an Amazon CloudWatch alarm to send an SNS notification to theAdministrator in the case of a failure.
AWS Lambda can be invoked directly by an Amazon S3 event notification, so it satisfies the requirement to start the workflow immediately on upload. AWS Step Functions is the service that can orchestrate the multi-step ETL and include a wait/polling state until all datasets are present, while AWS Glue is the managed ETL service designed to process and join multi-terabyte data in S3. For failure handling, Amazon CloudWatch alarms can monitor the Glue job state and publish to Amazon SNS, which delivers the Administrator notification; SNS is the standard alerting path when a monitored job enters a failed state.
B. Develop the ETL workflow using AWS Lambda to start an Amazon SageMaker notebook instance. Use a lifecycle configuration script to join the datasets and persist the results in Amazon S3. Use an Amazon CloudWatch alarm to sendan SNS notification to the Administrator in the case of a failure.
SageMaker notebook instances are for interactive development, not production-scale ETL orchestration.
C. Develop the ETL workflow using AWS Batch to trigger the start of ETL jobs when data is uploaded to Amazon S3. Use AWS Glue to join the datasets in Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to theAdministrator in the case of a failure.
AWS Batch runs batch compute jobs, but it is not the best fit for waiting on S3 dataset arrival dependencies.
D. Use AWS Lambda to chain other Lambda functions to read and join the datasets in Amazon S3 as soon as the data is uploaded to Amazon S3. Use an Amazon CloudWatch alarm to send an SNS notification to the Administrator in the case of a failure.
Lambda is not suited for multi-terabyte joins; ETL at that scale needs a dedicated data processing service.