NCP-AII Exam Glossary - 43 Terms

Search the terminology pack for NVIDIA Certified Professional: AI Infrastructure. Use these definitions with the study guide and practice questions.

Download App Study Guide Free Practice Exam

#

--gpus flag: A container runtime option used to assign GPU resources to a container.

A

algBW: In NCCL tests, the effective algorithm bandwidth achieved by the collective operation.

B

busBW: In NCCL tests, the measured bandwidth corresponding to actual interconnect traffic over the communication fabric.

C

clocks_throttle_reasons.hw_thermal_slowdown: nvidia-smi telemetry field indicating hardware-triggered clock reduction due to thermal limits.
clocks_throttle_reasons.sw_thermal_slowdown: nvidia-smi telemetry field indicating software-triggered clock reduction due to thermal conditions.
Communication hang: A condition where distributed communication stalls and processes stop making progress, often due to transport or synchronization issues.
Container runtime GPU configuration: Settings that control whether and how GPUs are exposed to containers, including runtime flags and environment variables.
CUDA_ERROR_NO_DEVICE: A CUDA runtime error indicating that no visible or usable GPU device is available to the application.

D

DCGM: NVIDIA Data Center GPU Manager, a toolset for discovering, monitoring, diagnosing, and managing GPUs in data center environments.
DCGM diagnostic Level 2: The minimum DCGM diagnostic level that includes memory stress testing in addition to basic health checks.
dcgmi discovery -l: DCGM CLI command that lists all GPUs discovered in the system.
dcgmi profile: DCGM command used to collect GPU profiling metrics and counters while workloads are running.

E

ECC page retirement: The permanent removal of faulty GPU memory pages from use after ECC detects repeated or severe errors.
Enroot: An unprivileged container runtime commonly used on HPC systems to run containerized workloads without Docker or root access.

G

GPU discovery: The process of detecting and enumerating GPU devices available on a system.
Graphics engine exception: A fault reported by the GPU graphics or compute engine when executing invalid or problematic workload instructions.
GRES: Generic RESources in Slurm, a mechanism for scheduling specialized resources such as GPUs.
gres.conf: The Slurm configuration file that defines GPU and other GRES resource mappings on compute nodes.

H

H100 SXM: An NVIDIA Hopper-generation SXM-form-factor GPU accelerator designed for high-performance AI and HPC workloads.
HBM3: High Bandwidth Memory generation 3, a stacked memory technology providing very high throughput for GPUs.

I

InfiniBand: A high-performance networking technology commonly used in HPC and AI clusters for low-latency, high-throughput communication.
ipmitool chassis power cycle: IPMI command that performs a full chassis power reset by powering the system off and then back on.

M

Memory bandwidth: The rate at which data can be read from or written to GPU memory, typically expressed in GB/s or TB/s.
Memory stress testing: A diagnostic procedure that exercises GPU memory heavily to detect stability or reliability issues.

N

NCCL: NVIDIA Collective Communications Library used for multi-GPU and multi-node communication primitives such as all-reduce and broadcast.
NCCL debug environment variables: Configuration variables such as NCCL_DEBUG and related settings used to troubleshoot communication failures and hangs.
NCCL_SOCKET_IFNAME: NCCL environment variable that forces NCCL to use a specified network interface instead of auto-selecting one.
Network interface: A system communication endpoint, such as an Ethernet or InfiniBand device, used for data transfer between hosts.
NGC container: A container image distributed through NVIDIA GPU Cloud, typically prebuilt for AI, HPC, and accelerated computing workloads.
NodeFail: A Slurm job or node state indicating that the scheduler marked the node as failed or unusable during job execution.
NVIDIA_VISIBLE_DEVICES: Environment variable used in NVIDIA container environments to specify which GPU devices are visible inside a container.
nvidia-smi: NVIDIA System Management Interface command-line tool for querying and controlling GPU status and settings.
nvidia-smi -pm 1: Command that enables GPU persistence mode using NVIDIA SMI.

O

Out of memory: A failure condition where available GPU memory is insufficient for the requested CUDA or NCCL operation.

P

PCIe: Peripheral Component Interconnect Express, a high-speed interface used to connect GPUs and other devices to the system.
Persistence mode: A GPU driver mode that keeps the NVIDIA driver loaded and devices initialized between application runs.
Port error counters: InfiniBand or network interface statistics that track link-level errors and other communication faults on a port.

R

Retired memory pages: GPU memory pages that have been removed from service due to single-bit or double-bit ECC errors.

S

Slurm: An open-source workload manager and job scheduler widely used in HPC and AI clusters.

T

Thermal throttling: Performance reduction caused by temperature limits being reached, forcing the GPU to slow down to protect hardware.

X

Xid 13: An NVIDIA GPU error code typically indicating a graphics engine exception, often associated with an application-side fault.
Xid 63: An NVIDIA GPU error code indicating ECC page retirement caused by memory errors.
Xid 79: An NVIDIA GPU error code indicating the GPU has fallen off the bus, usually due to a severe PCIe or hardware issue.

About These Definitions

These definitions are loaded from the shared release pack. Use them with the study guide and practice questions to connect vocabulary to exam scenarios.

Download App Read the full study guide Take the free practice exam