Study Guide
Databricks Certified Data Engineer Professional Study Guide
Use the saved domain outline to connect developing code for data processing using python and sql, data ingestion & acquisition, data transformation, cleansing, and quality, data sharing and federation to scenario-based questions and explanations.
How the Exam Is Structured
Databricks Certified Data Engineer Professional (DE Professional) validates developing code for data processing using python and sql, data ingestion & acquisition, data transformation, cleansing, and quality, data sharing and federation. The ExamPal practice bank includes 291 premium questions and 40 free questions mapped across the official blueprint.
| Domain | Weight | Focus |
|---|---|---|
| Domain 1: Developing Code for Data Processing using Python and SQL | 20% | Using Python and Tools for development; Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration |
| Domain 2: Data Ingestion & Acquisition | 10% | Design and implement data ingestion pipelines to efficiently ingest a variety of data formats including Delta Lake, Parquet, ORC, AVRO, JSON, CSV, XML, Text and Binary from diverse sources such as message buses and cloud storage; Create an append-only data pipeline capable of handling both batch and streaming data using Delta |
| Domain 3: Data Transformation, Cleansing, and Quality | 10% | Write efficient Spark SQL and PySpark code to apply advanced data transformations, including window functions, joins, and aggregations, to manipulate and analyze large Datasets; Use window functions |
| Domain 4: Data Sharing and Federation | 5% | Demonstrate delta sharing securely between Databricks deployments using Databricks to Databricks Sharing(D2D) or to external platforms using open sharing protocol(D2O); Secure sharing between Databricks deployments |
| Domain 5: Monitoring and Alerting | 10% | Monitoring; Use system tables for observability |
| Domain 6:Cost & Performance Optimisation | 5% | Understand how / why using Unity Catalog managed tables reduces operation Overhead and maintenance burden; Managed tables reduce overhead |
| Domain 7: Ensuring Data Security and Compliance | 10% | Applying Data Security mechanisms; Data security mechanisms |
| Domain 8: Data Governance | 5% | Create and add descriptions/metadata about enterprise data to make it more discoverable; Demonstrate understanding of Unity Catalog permission inheritance model |
| Domain 9: Debugging and Deploying | 15% | Debugging and Troubleshooting; Identify pertinent diagnostic information using Spark UI, cluster logs, system tables, and query profiles to troubleshoot errors |
| Domain 10: Data Modelling | 10% | Design and implement scalable data models using Delta Lake to manage large datasets; Simplify data layout decisions and optimize query performance using Liquid Clustering |
20% of exam
Domain 1: Developing Code for Data Processing using Python and SQL
This section covers building data-processing code in Python and SQL for the Databricks Lakehouse Platform. It emphasizes scalable project structure, dependency management, UDFs, ETL pipeline development, orchestration, environment configuration, and testing for production-grade data engineering solutions.
10% of exam
Domain 2: Data Ingestion & Acquisition
Covers designing and implementing data ingestion pipelines for efficiently ingesting a variety of data formats from diverse sources. It also includes building append-only pipelines that can handle both batch and streaming data using Delta.
10% of exam
Domain 3: Data Transformation, Cleansing, and Quality
Covers advanced data transformation, cleansing, and quality practices for working with large datasets. The section emphasizes efficient Spark SQL and PySpark implementations, including window functions, joins, and aggregations, as well as processes for isolating bad data using Lakeflow Declarative Pipelines or autoloader in classic jobs.
5% of exam
Domain 4: Data Sharing and Federation
This section covers secure data sharing between Databricks deployments and with external platforms, as well as federation across supported source systems. It emphasizes Delta Sharing, Databricks-to-Databricks sharing, open sharing protocols, and Lakehouse Federation governance.
10% of exam
Domain 5: Monitoring and Alerting
This section covers observability and alerting practices for Databricks workloads, including how to monitor resource utilization, cost, auditing, and workload performance. It also covers the tools and interfaces used to create alerts for data quality and job or pipeline issues.
5% of exam
Domain 6:Cost & Performance Optimisation
Covers techniques for reducing operational overhead and improving query performance in Databricks and Unity Catalog environments. The section emphasizes managed tables, Delta optimization features, query execution tuning, and the use of query profiles to diagnose bottlenecks on large datasets.
10% of exam
Domain 7: Ensuring Data Security and Compliance
Covers security controls and compliance practices for protecting workspace objects and sensitive table data. The section includes access control, row and column-level protection, anonymization techniques, PII masking, and data retention/purging requirements.
5% of exam
Domain 8: Data Governance
Covers the governance of enterprise data, including how metadata and descriptions improve discoverability and how permissions are inherited in Unity Catalog. The section focuses on making data easier to find and on understanding access control behavior within the catalog.
15% of exam
Domain 9: Debugging and Deploying
This section covers troubleshooting failed jobs and pipelines using Databricks diagnostic tools, then deploying Databricks resources through CI/CD workflows. It includes both debugging operational issues and implementing deployment automation with Databricks-native tooling.
10% of exam
Domain 10: Data Modelling
Covers designing and implementing scalable data models using Delta Lake for large datasets. It also includes data layout optimization with Liquid Clustering, comparing it to Partitioning and ZOrder, and designing dimensional models for analytical workloads.
Key Terms to Know
These terms are loaded from the shared terminology pack and appear across the question explanations.
- ACID
- Atomicity, consistency, isolation, and durability; the exam references ACID transaction behavior for Delta Lake operations.
- ACL
- Access control list; a security mechanism for controlling access to workspace objects and data assets.
- ACLs
- Access control lists used to secure workspace objects and enforce least-privilege access.
- APPLY CHANGES APIs
- APIs used in Lakeflow Declarative Pipelines to simplify change data capture (CDC).
- Apache Spark
- The distributed processing engine used on Databricks for ETL, streaming, SQL, and large-scale data transformations.
- Auto Loader
- A Databricks ingestion capability used to build reliable batch and streaming data pipelines that efficiently ingest new files from sources such as cloud storage.
- CDC
- Change Data Capture; the exam references APPLY CHANGES APIs as a way to simplify CDC in Lakeflow Declarative Pipelines.
- CDF
- Change Data Feed; an acronym for the Delta Lake feature that exposes data changes.
- CI/CD
- Continuous integration and continuous delivery/deployment; the exam covers integrating Databricks development and deployment workflows with CI/CD.
- CTAS
- An acronym for CREATE TABLE AS SELECT, used to create a derivative table from a query result. The text mentions it as a possible solution for creating a sales table from a marketing table.
- Change Data Feed
- A Delta Lake feature that exposes row-level changes and is used here to address limitations of streaming tables and improve latency.
- D2D
- Databricks-to-Databricks Sharing; sharing data securely between Databricks deployments.
- D2O
- Databricks-to-Open sharing; sharing data from Databricks to external platforms using an open sharing protocol.
- DABs
- Databricks Asset Bundles; a shorthand used for Databricks deployment packaging and automation.
- DBFS
- Databricks File System, a storage layer mentioned in the text as a place where an encoded password would be saved in one answer choice.
- DEEP CLONE
- A Delta Lake cloning feature that creates a new table and copies both data and metadata so the clone can be kept in sync with the source through changes committed to one table.
- DataFrame.transform
- A DataFrame method used in testing and transformation workflows to apply a function to a DataFrame.
- Databricks Asset Bundles
- A Databricks deployment mechanism used to package resources for modular development, deployment automation, and CI/CD integration.
Official Materials and Guidance
This page is built from Databricks official materials and ExamPal shared release pack, the shared syllabus, topic tree, terminology pack, free pack, and premium pack.
- -Databricks De Professional Exam Guide
- -Guidance: Official Databricks exam guide PDF with sample questions
- -Domain outline: Official guide lists sections, but saved official guide does not publish percentages: Python/SQL processing; ingestion; transformation/quality; sharing/federation; monitoring; cost/performance; security/compliance; governance; debugging/deploying; modelling.