DE Professional Exam Prep

Study Guide

Databricks Certified Data Engineer Professional Study Guide

Use the saved domain outline to connect developing code for data processing using python and sql, data ingestion & acquisition, data transformation, cleansing, and quality, data sharing and federation to scenario-based questions and explanations.

How the Exam Is Structured

Databricks Certified Data Engineer Professional (DE Professional) validates developing code for data processing using python and sql, data ingestion & acquisition, data transformation, cleansing, and quality, data sharing and federation. The ExamPal practice bank includes 291 premium questions and 40 free questions mapped across the official blueprint.

DomainWeightFocus
Domain 1: Developing Code for Data Processing using Python and SQL 20% Using Python and Tools for development; Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration
Domain 2: Data Ingestion & Acquisition 10% Design and implement data ingestion pipelines to efficiently ingest a variety of data formats including Delta Lake, Parquet, ORC, AVRO, JSON, CSV, XML, Text and Binary from diverse sources such as message buses and cloud storage; Create an append-only data pipeline capable of handling both batch and streaming data using Delta
Domain 3: Data Transformation, Cleansing, and Quality 10% Write efficient Spark SQL and PySpark code to apply advanced data transformations, including window functions, joins, and aggregations, to manipulate and analyze large Datasets; Use window functions
Domain 4: Data Sharing and Federation 5% Demonstrate delta sharing securely between Databricks deployments using Databricks to Databricks Sharing(D2D) or to external platforms using open sharing protocol(D2O); Secure sharing between Databricks deployments
Domain 5: Monitoring and Alerting 10% Monitoring; Use system tables for observability
Domain 6:Cost & Performance Optimisation 5% Understand how / why using Unity Catalog managed tables reduces operation Overhead and maintenance burden; Managed tables reduce overhead
Domain 7: Ensuring Data Security and Compliance 10% Applying Data Security mechanisms; Data security mechanisms
Domain 8: Data Governance 5% Create and add descriptions/metadata about enterprise data to make it more discoverable; Demonstrate understanding of Unity Catalog permission inheritance model
Domain 9: Debugging and Deploying 15% Debugging and Troubleshooting; Identify pertinent diagnostic information using Spark UI, cluster logs, system tables, and query profiles to troubleshoot errors
Domain 10: Data Modelling 10% Design and implement scalable data models using Delta Lake to manage large datasets; Simplify data layout decisions and optimize query performance using Liquid Clustering

20% of exam

Domain 1: Developing Code for Data Processing using Python and SQL

This section covers building data-processing code in Python and SQL for the Databricks Lakehouse Platform. It emphasizes scalable project structure, dependency management, UDFs, ETL pipeline development, orchestration, environment configuration, and testing for production-grade data engineering solutions.

Using Python and Tools for development
Design and implement a scalable Python project structure optimized for Databricks Asset Bundles (DABs), enabling modular development, deployment automation, and CI/CD integration
Manage third-party library installations
Develop User-Defined Functions
Building and Testing an ETL pipeline with Lakeflow Declarative Pipelines, SQL, and Apache Spark on the Databricks platform
Build and manage reliable, production-ready data pipelines
Create and Automate ETL workloads using Jobs via UI/APIs/CLI

10% of exam

Domain 2: Data Ingestion & Acquisition

Covers designing and implementing data ingestion pipelines for efficiently ingesting a variety of data formats from diverse sources. It also includes building append-only pipelines that can handle both batch and streaming data using Delta.

Design and implement data ingestion pipelines to efficiently ingest a variety of data formats including Delta Lake, Parquet, ORC, AVRO, JSON, CSV, XML, Text and Binary from diverse sources such as message buses and cloud storage
Create an append-only data pipeline capable of handling both batch and streaming data using Delta

10% of exam

Domain 3: Data Transformation, Cleansing, and Quality

Covers advanced data transformation, cleansing, and quality practices for working with large datasets. The section emphasizes efficient Spark SQL and PySpark implementations, including window functions, joins, and aggregations, as well as processes for isolating bad data using Lakeflow Declarative Pipelines or autoloader in classic jobs.

Write efficient Spark SQL and PySpark code to apply advanced data transformations, including window functions, joins, and aggregations, to manipulate and analyze large Datasets
Use window functions
Use joins
Use aggregations
Develop a quarantining process for bad data with Lakeflow Declarative Pipelines or autoloader in classic jobs
Quarantine bad data

5% of exam

Domain 4: Data Sharing and Federation

This section covers secure data sharing between Databricks deployments and with external platforms, as well as federation across supported source systems. It emphasizes Delta Sharing, Databricks-to-Databricks sharing, open sharing protocols, and Lakehouse Federation governance.

Demonstrate delta sharing securely between Databricks deployments using Databricks to Databricks Sharing(D2D) or to external platforms using open sharing protocol(D2O)
Secure sharing between Databricks deployments
Configure Lakehouse Federation with proper governance across supported source Systems
Lakehouse Federation governance
Use Delta Share to share live data from Lakehouse to any computing platform
Share live Lakehouse data

10% of exam

Domain 5: Monitoring and Alerting

This section covers observability and alerting practices for Databricks workloads, including how to monitor resource utilization, cost, auditing, and workload performance. It also covers the tools and interfaces used to create alerts for data quality and job or pipeline issues.

Monitoring
Use system tables for observability
Use Query Profiler UI and Spark UI
Use the Databricks REST APIs/Databricks CLI
Use Lakeflow Declarative Pipelines Event Logs
Alerting
Use SQL Alerts to monitor data quality

5% of exam

Domain 6:Cost & Performance Optimisation

Covers techniques for reducing operational overhead and improving query performance in Databricks and Unity Catalog environments. The section emphasizes managed tables, Delta optimization features, query execution tuning, and the use of query profiles to diagnose bottlenecks on large datasets.

Understand how / why using Unity Catalog managed tables reduces operation Overhead and maintenance burden
Managed tables reduce overhead
Understand delta optimization techniques, such as deletion vectors and liquid clustering.​ ​
Deletion vectors and liquid clustering
Understand the optimization techniques used by Databricks to ensure the performance of queries on large datasets (data skipping, file pruning, etc)
Data skipping and file pruning
Apply Change Data Feed (CDF) to address specific limitations of streaming tables and enhance latency

10% of exam

Domain 7: Ensuring Data Security and Compliance

Covers security controls and compliance practices for protecting workspace objects and sensitive table data. The section includes access control, row and column-level protection, anonymization techniques, PII masking, and data retention/purging requirements.

Applying Data Security mechanisms
Data security mechanisms
Use ACLs to secure Workspace Objects, enforcing the principle of least privilege, including enforcing principles like least privilege, policy enforcement
ACLs for workspace objects
Use row filters and column masks to filter and mask sensitive table data
Row filters and column masks
Apply anonymization and pseudonymization methods such as Hashing, Tokenization, Suppression, and Generalization to confidential data

5% of exam

Domain 8: Data Governance

Covers the governance of enterprise data, including how metadata and descriptions improve discoverability and how permissions are inherited in Unity Catalog. The section focuses on making data easier to find and on understanding access control behavior within the catalog.

Create and add descriptions/metadata about enterprise data to make it more discoverable
Demonstrate understanding of Unity Catalog permission inheritance model

15% of exam

Domain 9: Debugging and Deploying

This section covers troubleshooting failed jobs and pipelines using Databricks diagnostic tools, then deploying Databricks resources through CI/CD workflows. It includes both debugging operational issues and implementing deployment automation with Databricks-native tooling.

Debugging and Troubleshooting
Identify pertinent diagnostic information using Spark UI, cluster logs, system tables, and query profiles to troubleshoot errors
Analyze the errors and remediate the failed job runs with job repairs and parameter overrides
Use Lakeflow Declarative Pipelines event logs & the Spark UI to debug Lakeflow Declarative Pipelines and Spark pipelines
Deploying CI/CD
Build and Deploy Databricks resources using Databricks Asset Bundles
Configure and integrate with Git-based CI/CD workflows using Databricks Git Folders for notebook and code deployment

10% of exam

Domain 10: Data Modelling

Covers designing and implementing scalable data models using Delta Lake for large datasets. It also includes data layout optimization with Liquid Clustering, comparing it to Partitioning and ZOrder, and designing dimensional models for analytical workloads.

Design and implement scalable data models using Delta Lake to manage large datasets
Simplify data layout decisions and optimize query performance using Liquid Clustering
Identify the benefits of using liquid Clustering over Partitioning and ZOrder
Design Dimensional Models for analytical workloads, ensuring efficient querying and aggregation

Key Terms to Know

These terms are loaded from the shared terminology pack and appear across the question explanations.

ACID
Atomicity, consistency, isolation, and durability; the exam references ACID transaction behavior for Delta Lake operations.
ACL
Access control list; a security mechanism for controlling access to workspace objects and data assets.
ACLs
Access control lists used to secure workspace objects and enforce least-privilege access.
APPLY CHANGES APIs
APIs used in Lakeflow Declarative Pipelines to simplify change data capture (CDC).
Apache Spark
The distributed processing engine used on Databricks for ETL, streaming, SQL, and large-scale data transformations.
Auto Loader
A Databricks ingestion capability used to build reliable batch and streaming data pipelines that efficiently ingest new files from sources such as cloud storage.
CDC
Change Data Capture; the exam references APPLY CHANGES APIs as a way to simplify CDC in Lakeflow Declarative Pipelines.
CDF
Change Data Feed; an acronym for the Delta Lake feature that exposes data changes.
CI/CD
Continuous integration and continuous delivery/deployment; the exam covers integrating Databricks development and deployment workflows with CI/CD.
CTAS
An acronym for CREATE TABLE AS SELECT, used to create a derivative table from a query result. The text mentions it as a possible solution for creating a sales table from a marketing table.
Change Data Feed
A Delta Lake feature that exposes row-level changes and is used here to address limitations of streaming tables and improve latency.
D2D
Databricks-to-Databricks Sharing; sharing data securely between Databricks deployments.
D2O
Databricks-to-Open sharing; sharing data from Databricks to external platforms using an open sharing protocol.
DABs
Databricks Asset Bundles; a shorthand used for Databricks deployment packaging and automation.
DBFS
Databricks File System, a storage layer mentioned in the text as a place where an encoded password would be saved in one answer choice.
DEEP CLONE
A Delta Lake cloning feature that creates a new table and copies both data and metadata so the clone can be kept in sync with the source through changes committed to one table.
DataFrame.transform
A DataFrame method used in testing and transformation workflows to apply a function to a DataFrame.
Databricks Asset Bundles
A Databricks deployment mechanism used to package resources for modular development, deployment automation, and CI/CD integration.

Official Materials and Guidance

This page is built from Databricks official materials and ExamPal shared release pack, the shared syllabus, topic tree, terminology pack, free pack, and premium pack.

  • -Databricks De Professional Exam Guide
  • -Guidance: Official Databricks exam guide PDF with sample questions
  • -Domain outline: Official guide lists sections, but saved official guide does not publish percentages: Python/SQL processing; ingestion; transformation/quality; sharing/federation; monitoring; cost/performance; security/compliance; governance; debugging/deploying; modelling.