ML Associate Exam Prep

Study Guide

Databricks Certified Machine Learning Associate Study Guide

Use the saved domain outline to connect databricks machine learning, data processing, model development, model deployment to scenario-based questions and explanations.

How the Exam Is Structured

Databricks Certified Machine Learning Associate (ML Associate) validates databricks machine learning, data processing, model development, model deployment. The ExamPal practice bank includes 372 premium questions and 40 free questions mapped across the official blueprint.

DomainWeightFocus
Domain 1: Databricks Machine Learning 29% Identify the best practices of an MLOps strategy; Best practices of an MLOps strategy
Domain 2: Data Processing 29% Compute summary statistics on a Spark DataFrame using .summary() or dbutils data summaries; Remove outliers from a Spark DataFrame based on standard deviation or IQR
Domain 3: Model Development 31% Use ML foundations to select the appropriate algorithm for a given model scenario; Identify methods to mitigate data imbalance in training data
Domain 4: Model Deployment 11% Identify the differences and advantages of model serving approaches: batch, realtime, and streaming; Deploy a custom model to a model endpoint

29% of exam

Domain 1: Databricks Machine Learning

This section covers the core Databricks machine learning workflow for the Databricks Certified Machine Learning Associate exam. It emphasizes MLOps strategy, AutoML, Unity Catalog feature store usage, MLflow tracking and model registry, and model promotion practices.

Identify the best practices of an MLOps strategy
Best practices of an MLOps strategy
Identify the advantages of using ML runtimes
Advantages of using ML runtimes
Identify how AutoML facilitates model/feature selection
AutoML facilitates model/feature selection
Identify the advantages AutoML brings to the model development process

29% of exam

Domain 2: Data Processing

This section covers practical data preparation and exploratory analysis tasks in Spark. It includes summary statistics, outlier handling, visualization, feature comparison, missing-value imputation, one-hot encoding, and log-scale transformation decisions.

Compute summary statistics on a Spark DataFrame using .summary() or dbutils data summaries
Remove outliers from a Spark DataFrame based on standard deviation or IQR
Create visualizations for categorical or continuous features
Compare two categorical or two continuous features using the appropriate method
Compare and contrast imputing missing values with the mean or median or mode value
Impute missing values with the mode, mean, or median value
Use one-hot encoding for categorical features

31% of exam

Domain 3: Model Development

Covers the practical skills needed to develop, tune, and evaluate machine learning models. This section emphasizes selecting appropriate algorithms, handling imbalanced data, building training pipelines, and using validation strategies and metrics to assess model performance.

Use ML foundations to select the appropriate algorithm for a given model scenario
Identify methods to mitigate data imbalance in training data
Compare estimators and transformers
Develop a training pipeline
Use Hyperopt's fmin operation to tune a model's hyperparameters
Perform random or grid search or Bayesian search as a method for tuning hyperparameters
Parallelize single node models for hyperparameter tuning

11% of exam

Domain 4: Model Deployment

Covers how to deploy machine learning models for batch, realtime, and streaming inference. It also includes using pandas for batch inference, Delta Live Tables for streaming inference, and deploying/querying models for realtime inference, including splitting data between endpoints for realtime interference.

Identify the differences and advantages of model serving approaches: batch, realtime, and streaming
Deploy a custom model to a model endpoint
Use pandas to perform batch inference
Identify how streaming inference is performed with Delta Live Tables
Deploy and query a model for realtime inference
Split data between endpoints for realtime interference

Key Terms to Know

These terms are loaded from the shared terminology pack and appear across the question explanations.

AutoML
A Databricks capability that helps facilitate model and feature selection and is described as improving the model development process.
Databricks
A platform used in the exam context to perform machine learning tasks and work with tools such as AutoML, Unity Catalog, MLflow, and Delta Live Tables.
Databricks Certified Machine Learning Associate
A Databricks certification exam that assesses the ability to use Databricks for basic machine learning tasks, including data exploration, feature engineering, model training, tuning, evaluation, and deployment.
Delta Live Tables
A Databricks data pipeline feature used in the exam for data management and for streaming inference.
F1
A classification metric used in the exam.
GridSearchCV
A scikit-learn tool for exhaustive hyperparameter search with cross-validation.
Hyperopt
A hyperparameter tuning library referenced in the exam, including its fmin operation.
IQR
Interquartile range, used in the exam as one method for removing outliers from a Spark DataFrame.
Log Loss
A classification metric used in the exam.
MAE
Mean absolute error, a regression metric used in the exam.
ML runtimes
Databricks machine learning runtime environments whose advantages are tested in the exam.
MLOps
A machine learning operations strategy whose best practices are identified in the exam.
MLflow
A machine learning lifecycle tool referenced in the exam for logging metrics, artifacts, and models, inspecting runs in the UI, and registering models through its client API.
MLflow Client API
The programmatic interface used to identify the best run, log metrics, artifacts, and models, and register models in the Unity Catalog registry.
MLflow Run
A single execution record in MLflow where metrics, artifacts, and models can be logged manually.
MLflow UI
The user interface in MLflow where information about runs and related model-development details can be viewed.
R-squared
A regression metric used in the exam.
RMSE
Root mean squared error, a regression metric used in the exam.

Official Materials and Guidance

This page is built from Databricks official materials and ExamPal shared release pack, the shared syllabus, topic tree, terminology pack, free pack, and premium pack.

  • -Databricks Ml Associate Exam Guide
  • -Guidance: Official Databricks exam guide PDF with sample questions
  • -Domain outline: Databricks Machine Learning 38%; ML Workflows 19%; Model Development 31%; Model Deployment 12%.