DE Professional Practice Q15

A. Snapshot the API data to Delta and join it in Spark

Batch lookups against a REST endpoint would force one external call per product code, which is not a scalable Spark pattern for large batch jobs and creates avoidable latency and rate-limit risk. The Databricks exam guide emphasizes using Spark SQL/PySpark joins on large datasets and Delta Lake for scalable transformations; persisting the reference set to Delta first lets Spark perform an in-cluster join instead of repeatedly invoking the API during the job.

B. Call the API from a Python UDF for each row

Pandas/Python UDFs run per-row logic and are not a scalable pattern for external API lookups.

C. Use `requests` inside `mapPartitions` for every row in every batch

mapPartitions processes partitions, not every row efficiently for repeated API calls across batches.

D. Query the API from the driver whenever a task completes

Driver-side API calls create a bottleneck and do not scale with distributed task execution.

Question 15

Explanation

Why each option is right or wrong