Question 15
Domain 3: Data Transformation, Cleansing, and QualityA batch transformation needs reference data from a REST API for every product code. Which design is most scalable on Databricks?
Correct answer: A
Explanation
Snapshotting the REST API data to Delta avoids calling the API for every row, which is more scalable for batch processing. Databricks emphasizes using Delta Lake for efficient joins and large-scale transformations, and Spark can join the cached reference data directly instead of repeatedly hitting an external service.
Why each option is right or wrong
A. Snapshot the API data to Delta and join it in Spark
Batch lookups against a REST endpoint would force one external call per product code, which is not a scalable Spark pattern for large batch jobs and creates avoidable latency and rate-limit risk. The Databricks exam guide emphasizes using Spark SQL/PySpark joins on large datasets and Delta Lake for scalable transformations; persisting the reference set to Delta first lets Spark perform an in-cluster join instead of repeatedly invoking the API during the job.
B. Call the API from a Python UDF for each row
Pandas/Python UDFs run per-row logic and are not a scalable pattern for external API lookups.
C. Use `requests` inside `mapPartitions` for every row in every batch
mapPartitions processes partitions, not every row efficiently for repeated API calls across batches.
D. Query the API from the driver whenever a task completes
Driver-side API calls create a bottleneck and do not scale with distributed task execution.