Study Guide

Databricks Certified Machine Learning Associate Study Guide

Use the saved domain outline to connect databricks machine learning, data processing, model development, model deployment to scenario-based questions and explanations.

Download App Free Practice Exam Key Terms Glossary

How the Exam Is Structured

Databricks Certified Machine Learning Associate (ML Associate) validates databricks machine learning, data processing, model development, model deployment. The ExamPal practice bank includes 372 premium questions and 40 free questions mapped across the official blueprint.

Domain	Weight	Focus
Domain 1: Databricks Machine Learning	29%	Identify the best practices of an MLOps strategy; Best practices of an MLOps strategy
Domain 2: Data Processing	29%	Compute summary statistics on a Spark DataFrame using .summary() or dbutils data summaries; Remove outliers from a Spark DataFrame based on standard deviation or IQR
Domain 3: Model Development	31%	Use ML foundations to select the appropriate algorithm for a given model scenario; Identify methods to mitigate data imbalance in training data
Domain 4: Model Deployment	11%	Identify the differences and advantages of model serving approaches: batch, realtime, and streaming; Deploy a custom model to a model endpoint

29% of exam

Domain 1: Databricks Machine Learning

This section covers the core Databricks machine learning workflow for the Databricks Certified Machine Learning Associate exam. It emphasizes MLOps strategy, AutoML, Unity Catalog feature store usage, MLflow tracking and model registry, and model promotion practices.

Identify the best practices of an MLOps strategy

Best practices of an MLOps strategy

Identify the advantages of using ML runtimes

Advantages of using ML runtimes

Identify how AutoML facilitates model/feature selection

AutoML facilitates model/feature selection

Identify the advantages AutoML brings to the model development process

29% of exam

Domain 2: Data Processing

This section covers practical data preparation and exploratory analysis tasks in Spark. It includes summary statistics, outlier handling, visualization, feature comparison, missing-value imputation, one-hot encoding, and log-scale transformation decisions.

Compute summary statistics on a Spark DataFrame using .summary() or dbutils data summaries

Remove outliers from a Spark DataFrame based on standard deviation or IQR

Create visualizations for categorical or continuous features

Compare two categorical or two continuous features using the appropriate method

Compare and contrast imputing missing values with the mean or median or mode value

Impute missing values with the mode, mean, or median value

Use one-hot encoding for categorical features

31% of exam

Domain 3: Model Development

Covers the practical skills needed to develop, tune, and evaluate machine learning models. This section emphasizes selecting appropriate algorithms, handling imbalanced data, building training pipelines, and using validation strategies and metrics to assess model performance.

Use ML foundations to select the appropriate algorithm for a given model scenario

Identify methods to mitigate data imbalance in training data

Compare estimators and transformers

Develop a training pipeline

Use Hyperopt's fmin operation to tune a model's hyperparameters

Perform random or grid search or Bayesian search as a method for tuning hyperparameters

Parallelize single node models for hyperparameter tuning

11% of exam

Domain 4: Model Deployment

Covers how to deploy machine learning models for batch, realtime, and streaming inference. It also includes using pandas for batch inference, Delta Live Tables for streaming inference, and deploying/querying models for realtime inference, including splitting data between endpoints for realtime interference.

Identify the differences and advantages of model serving approaches: batch, realtime, and streaming

Deploy a custom model to a model endpoint

Use pandas to perform batch inference

Identify how streaming inference is performed with Delta Live Tables

Deploy and query a model for realtime inference

Split data between endpoints for realtime interference

Key Terms to Know

These terms are loaded from the shared terminology pack and appear across the question explanations.

AutoML: A Databricks capability that helps facilitate model and feature selection and is described as improving the model development process.
Databricks: A platform used in the exam context to perform machine learning tasks and work with tools such as AutoML, Unity Catalog, MLflow, and Delta Live Tables.
Databricks Certified Machine Learning Associate: A Databricks certification exam that assesses the ability to use Databricks for basic machine learning tasks, including data exploration, feature engineering, model training, tuning, evaluation, and deployment.
Delta Live Tables: A Databricks data pipeline feature used in the exam for data management and for streaming inference.
F1: A classification metric used in the exam.
GridSearchCV: A scikit-learn tool for exhaustive hyperparameter search with cross-validation.
Hyperopt: A hyperparameter tuning library referenced in the exam, including its fmin operation.
IQR: Interquartile range, used in the exam as one method for removing outliers from a Spark DataFrame.
Log Loss: A classification metric used in the exam.
MAE: Mean absolute error, a regression metric used in the exam.
ML runtimes: Databricks machine learning runtime environments whose advantages are tested in the exam.
MLOps: A machine learning operations strategy whose best practices are identified in the exam.
MLflow: A machine learning lifecycle tool referenced in the exam for logging metrics, artifacts, and models, inspecting runs in the UI, and registering models through its client API.
MLflow Client API: The programmatic interface used to identify the best run, log metrics, artifacts, and models, and register models in the Unity Catalog registry.
MLflow Run: A single execution record in MLflow where metrics, artifacts, and models can be logged manually.
MLflow UI: The user interface in MLflow where information about runs and related model-development details can be viewed.
R-squared: A regression metric used in the exam.
RMSE: Root mean squared error, a regression metric used in the exam.

Official Materials and Guidance

This page is built from Databricks official materials and ExamPal shared release pack, the shared syllabus, topic tree, terminology pack, free pack, and premium pack.

-Databricks Ml Associate Exam Guide
-Guidance: Official Databricks exam guide PDF with sample questions
-Domain outline: Databricks Machine Learning 38%; ML Workflows 19%; Model Development 31%; Model Deployment 12%.

Download App Official source Start Free Practice Exam