Study Guide
Databricks Certified Machine Learning Associate Study Guide
Use the saved domain outline to connect databricks machine learning, data processing, model development, model deployment to scenario-based questions and explanations.
How the Exam Is Structured
Databricks Certified Machine Learning Associate (ML Associate) validates databricks machine learning, data processing, model development, model deployment. The ExamPal practice bank includes 372 premium questions and 40 free questions mapped across the official blueprint.
| Domain | Weight | Focus |
|---|---|---|
| Domain 1: Databricks Machine Learning | 29% | Identify the best practices of an MLOps strategy; Best practices of an MLOps strategy |
| Domain 2: Data Processing | 29% | Compute summary statistics on a Spark DataFrame using .summary() or dbutils data summaries; Remove outliers from a Spark DataFrame based on standard deviation or IQR |
| Domain 3: Model Development | 31% | Use ML foundations to select the appropriate algorithm for a given model scenario; Identify methods to mitigate data imbalance in training data |
| Domain 4: Model Deployment | 11% | Identify the differences and advantages of model serving approaches: batch, realtime, and streaming; Deploy a custom model to a model endpoint |
29% of exam
Domain 1: Databricks Machine Learning
This section covers the core Databricks machine learning workflow for the Databricks Certified Machine Learning Associate exam. It emphasizes MLOps strategy, AutoML, Unity Catalog feature store usage, MLflow tracking and model registry, and model promotion practices.
29% of exam
Domain 2: Data Processing
This section covers practical data preparation and exploratory analysis tasks in Spark. It includes summary statistics, outlier handling, visualization, feature comparison, missing-value imputation, one-hot encoding, and log-scale transformation decisions.
31% of exam
Domain 3: Model Development
Covers the practical skills needed to develop, tune, and evaluate machine learning models. This section emphasizes selecting appropriate algorithms, handling imbalanced data, building training pipelines, and using validation strategies and metrics to assess model performance.
11% of exam
Domain 4: Model Deployment
Covers how to deploy machine learning models for batch, realtime, and streaming inference. It also includes using pandas for batch inference, Delta Live Tables for streaming inference, and deploying/querying models for realtime inference, including splitting data between endpoints for realtime interference.
Key Terms to Know
These terms are loaded from the shared terminology pack and appear across the question explanations.
- AutoML
- A Databricks capability that helps facilitate model and feature selection and is described as improving the model development process.
- Databricks
- A platform used in the exam context to perform machine learning tasks and work with tools such as AutoML, Unity Catalog, MLflow, and Delta Live Tables.
- Databricks Certified Machine Learning Associate
- A Databricks certification exam that assesses the ability to use Databricks for basic machine learning tasks, including data exploration, feature engineering, model training, tuning, evaluation, and deployment.
- Delta Live Tables
- A Databricks data pipeline feature used in the exam for data management and for streaming inference.
- F1
- A classification metric used in the exam.
- GridSearchCV
- A scikit-learn tool for exhaustive hyperparameter search with cross-validation.
- Hyperopt
- A hyperparameter tuning library referenced in the exam, including its fmin operation.
- IQR
- Interquartile range, used in the exam as one method for removing outliers from a Spark DataFrame.
- Log Loss
- A classification metric used in the exam.
- MAE
- Mean absolute error, a regression metric used in the exam.
- ML runtimes
- Databricks machine learning runtime environments whose advantages are tested in the exam.
- MLOps
- A machine learning operations strategy whose best practices are identified in the exam.
- MLflow
- A machine learning lifecycle tool referenced in the exam for logging metrics, artifacts, and models, inspecting runs in the UI, and registering models through its client API.
- MLflow Client API
- The programmatic interface used to identify the best run, log metrics, artifacts, and models, and register models in the Unity Catalog registry.
- MLflow Run
- A single execution record in MLflow where metrics, artifacts, and models can be logged manually.
- MLflow UI
- The user interface in MLflow where information about runs and related model-development details can be viewed.
- R-squared
- A regression metric used in the exam.
- RMSE
- Root mean squared error, a regression metric used in the exam.
Official Materials and Guidance
This page is built from Databricks official materials and ExamPal shared release pack, the shared syllabus, topic tree, terminology pack, free pack, and premium pack.
- -Databricks Ml Associate Exam Guide
- -Guidance: Official Databricks exam guide PDF with sample questions
- -Domain outline: Databricks Machine Learning 38%; ML Workflows 19%; Model Development 31%; Model Deployment 12%.