DP-100 Exam Prep
DP-100 Exam Glossary - 40 Terms
Search the terminology pack for Designing and Implementing a Data Science Solution on Azure. Use these definitions with the study guide and practice questions.
A
- artifact
- A file or folder produced or used by an ML run, such as images, models, or output datasets.
- autoscaling
- The ability of a compute resource to automatically increase or decrease the number of nodes based on workload demand.
- Azure Machine Learning Designer
- A visual interface in Azure ML used to build, configure, and run machine learning pipelines without extensive coding.
- Azure ML component
- A reusable, versionable unit of work in Azure ML pipelines that encapsulates code, environment, inputs, and outputs.
- Azure ML Python SDK v2
- The version 2 Python software development kit used to interact programmatically with Azure Machine Learning resources.
- Azure ML workspace
- The central Azure Machine Learning resource that stores assets, runs, compute targets, and configuration for ML projects.
B
- binary classification
- A supervised learning task in which a model predicts one of two possible classes.
C
- compute cluster
- An Azure ML compute target made up of multiple nodes that can run training or batch workloads.
- Conda configuration file
- A YAML file that specifies Conda packages and dependencies required for an ML environment.
- conda_file
- A parameter used when creating an Azure ML environment from a Conda YAML specification.
- config.json
- A workspace configuration file that stores connection details needed for SDK code to connect to Azure Machine Learning.
D
- data leakage
- The unintended use of information from evaluation or future data during training, leading to overly optimistic performance.
- differential privacy
- A privacy-preserving technique that adds statistical noise to outputs to protect individual records in a dataset.
E
- Environment
- An Azure ML SDK v2 class used to define the software environment, including dependencies, for training or inference.
F
- fairness
- The principle of ensuring that model outcomes and performance do not systematically disadvantage protected groups.
G
- generalization
- A model’s ability to perform well on unseen data rather than only on data used during training.
H
- hold-out set
- A portion of data kept separate from training and used only for evaluation to measure how well a model generalizes.
I
- image classification
- A machine learning task in which a model predicts the category of an input image.
- Import Data
- A Designer component used to bring external data, such as a CSV file from a website, into a pipeline.
- init()
- A required function in a ParallelRunStep entry script used to initialize resources before processing begins.
- input tokens
- The individual text units, such as words or subwords, that are processed by a language model.
M
- machine translation
- A natural language processing task where a model translates text from one language into another.
- ml_client.data.get
- A Python SDK v2 method used to retrieve a registered data asset from an Azure ML workspace by name.
- MLflow
- An open-source platform for tracking experiments, logging metrics and artifacts, and managing ML lifecycle tasks.
- mlflow.log_artifact
- An MLflow function used to log a file or directory, such as a folder of images, as an experiment artifact.
- mlflow.log_dict
- An MLflow function used to log structured dictionary data, such as RGB values, during an experiment.
- mlflow.log_metric
- An MLflow function used to record numeric metrics, including custom telemetry values, during a run.
- MLOps
- A set of practices for automating, managing, deploying, monitoring, and retraining machine learning systems.
- MLTable
- An Azure Machine Learning data asset format used to define tabular or file-based datasets for ML workflows.
N
- node
- An individual machine or compute instance within a cluster.
P
- ParallelRunStep
- An Azure ML pipeline step used for scalable parallel batch inference over large datasets.
- performance metrics
- Quantitative measures used to assess how well a machine learning model performs.
- pipeline
- A sequence of connected ML workflow steps used to automate data preparation, training, evaluation, or deployment.
- protected groups
- Population groups defined by sensitive attributes such as ethnicity or gender that are monitored for fairness.
R
- registered data asset
- A dataset or other data resource stored and versioned in an Azure ML workspace for reuse.
- run()
- A required function in a ParallelRunStep entry script used to process input data and return results.
S
- selection rate
- A fairness metric that measures how often a model assigns a favorable outcome to members of a group.
T
- telemetry
- Operational or custom logging data collected from ML systems to monitor behavior and performance.
V
- versioning
- The practice of tracking changes to assets like components, data, or models so specific versions can be reused reliably.
Y
- YAML
- A human-readable configuration format commonly used to define Azure ML components and pipeline settings.
About These Definitions
These definitions are loaded from the shared release pack. Use them with the study guide and practice questions to connect vocabulary to exam scenarios.