Question 2
Domain 1: Data Preparation for Machine Learning (ML)A company has an Amazon S3 bucket that contains 1 ТВ of files from different sources. The S3 bucket contains the following file types in the same S3 folder: CSV, JSON, XLSX, and Apache Parquet. An ML engineer must implement a solution that uses AWS Glue DataBrew to process the data. The ML engineer also must store the final output in Amazon S3 so that AWS Glue can consume the output in the future. Which solution will meet these requirements?
Correct answer: C
Explanation
AWS Glue DataBrew works on a single dataset schema at a time, so files with different structures should be separated into different folders and processed individually. Storing the result in Apache Parquet meets the requirement that the output be in a format AWS Glue can consume later, since Parquet is a supported columnar format for Glue ETL and cataloging.
Why each option is right or wrong
A. Use DataBrew to process the existing S3 folder. Store the output in Apache Parquet format.
Mixed file types in one folder do not provide a single consistent dataset for DataBrew processing.
B. Use DataBrew to process the existing S3 folder. Store the output in AWS Glue Parquet format.
The key issue is mixed input formats; “AWS Glue Parquet” is not the needed distinction here.
C. Separate the data into a different folder for each file type. Use DataBrew to process each folder individually. Store the output in Apache Parquet format.
AWS Glue DataBrew recipes operate on one dataset schema at a time, so mixing CSV, JSON, XLSX, and Parquet in the same folder prevents a single DataBrew dataset from being inferred and transformed consistently; the files must be separated and ingested as distinct datasets. Writing the result to Apache Parquet satisfies the downstream Glue requirement because Glue natively reads Parquet as a supported columnar format for ETL and catalog-based processing, unlike keeping the output in a mixed, nonuniform layout.
D. Separate the data into a different folder for each file type. Use DataBrew to process each folder individually. Store the output in AWS Glue Parquet format.
Separating by file type is right, but standard Apache Parquet is the appropriate output format choice.