Question 32
Content Domain 1: Data EngineeringA financial services company is building a robust serverless data lake on Amazon S3. The data lake should be flexible and meet the following requirements. Support querying old and new data on Amazon S3 through Amazon Athena and Amazon Redshift Spectrum. Support event-driven ETL pipelines Provide a quick and easy way to understand metadata Which approach meets these requirements?
Correct answer: A
Explanation
AWS Glue Data Catalog provides a central metadata store that Athena and Redshift Spectrum use to query both old and new S3 data, and a crawler can "crawl S3 data" to keep table definitions current. AWS Lambda can "trigger an AWS Glue ETL job," enabling event-driven pipelines, while the Data Catalog gives a "quick and easy way to understand metadata" through searchable schemas and tables.
Why each option is right or wrong
A. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Glue ETL job, and an AWS Glue Data Catalog to search and discover metadata.
Amazon Athena and Amazon Redshift Spectrum both rely on the AWS Glue Data Catalog as the shared metastore for table definitions and partitions, so using a crawler to update that catalog lets queries continue to work across both historical and newly added S3 objects without manual schema maintenance. AWS Glue crawlers are designed to infer schema and partition metadata from S3 and populate the Data Catalog, while AWS Lambda can invoke Glue jobs in response to S3 or EventBridge events for event-driven ETL. The Data Catalog also provides searchable metadata for databases, tables, and columns, which is the quickest way to inspect and discover the lake’s structure.
B. Use an AWS Glue crawler to crawl S3 data, an AWS Lambda function to trigger an AWS Batch job, and an external Apache Hive metastore to search and discover metadata.
AWS Batch runs general batch compute, not the most direct serverless ETL pattern or native metadata approach here.
C. Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Batch job, and an AWS Glue Data Catalog to search and discover metadata.
CloudWatch alarms react to metrics, not typical data-arrival events for ETL pipeline triggering.
D. Use an AWS Glue crawler to crawl S3 data, an Amazon CloudWatch alarm to trigger an AWS Glue ETL job, and an external Apache Hive metastore to search and discover metadata.
External Hive metastore adds management overhead and misses the simpler managed catalog requirement.