NCP-AII Exam Prep

NCP-AII Exam Glossary - 43 Terms

Search the terminology pack for NVIDIA Certified Professional: AI Infrastructure. Use these definitions with the study guide and practice questions.

#

--gpus flag
A container runtime option used to assign GPU resources to a container.

A

algBW
In NCCL tests, the effective algorithm bandwidth achieved by the collective operation.

B

busBW
In NCCL tests, the measured bandwidth corresponding to actual interconnect traffic over the communication fabric.

C

clocks_throttle_reasons.hw_thermal_slowdown
nvidia-smi telemetry field indicating hardware-triggered clock reduction due to thermal limits.
clocks_throttle_reasons.sw_thermal_slowdown
nvidia-smi telemetry field indicating software-triggered clock reduction due to thermal conditions.
Communication hang
A condition where distributed communication stalls and processes stop making progress, often due to transport or synchronization issues.
Container runtime GPU configuration
Settings that control whether and how GPUs are exposed to containers, including runtime flags and environment variables.
CUDA_ERROR_NO_DEVICE
A CUDA runtime error indicating that no visible or usable GPU device is available to the application.

D

DCGM
NVIDIA Data Center GPU Manager, a toolset for discovering, monitoring, diagnosing, and managing GPUs in data center environments.
DCGM diagnostic Level 2
The minimum DCGM diagnostic level that includes memory stress testing in addition to basic health checks.
dcgmi discovery -l
DCGM CLI command that lists all GPUs discovered in the system.
dcgmi profile
DCGM command used to collect GPU profiling metrics and counters while workloads are running.

E

ECC page retirement
The permanent removal of faulty GPU memory pages from use after ECC detects repeated or severe errors.
Enroot
An unprivileged container runtime commonly used on HPC systems to run containerized workloads without Docker or root access.

G

GPU discovery
The process of detecting and enumerating GPU devices available on a system.
Graphics engine exception
A fault reported by the GPU graphics or compute engine when executing invalid or problematic workload instructions.
GRES
Generic RESources in Slurm, a mechanism for scheduling specialized resources such as GPUs.
gres.conf
The Slurm configuration file that defines GPU and other GRES resource mappings on compute nodes.

H

H100 SXM
An NVIDIA Hopper-generation SXM-form-factor GPU accelerator designed for high-performance AI and HPC workloads.
HBM3
High Bandwidth Memory generation 3, a stacked memory technology providing very high throughput for GPUs.

I

InfiniBand
A high-performance networking technology commonly used in HPC and AI clusters for low-latency, high-throughput communication.
ipmitool chassis power cycle
IPMI command that performs a full chassis power reset by powering the system off and then back on.

M

Memory bandwidth
The rate at which data can be read from or written to GPU memory, typically expressed in GB/s or TB/s.
Memory stress testing
A diagnostic procedure that exercises GPU memory heavily to detect stability or reliability issues.

N

NCCL
NVIDIA Collective Communications Library used for multi-GPU and multi-node communication primitives such as all-reduce and broadcast.
NCCL debug environment variables
Configuration variables such as NCCL_DEBUG and related settings used to troubleshoot communication failures and hangs.
NCCL_SOCKET_IFNAME
NCCL environment variable that forces NCCL to use a specified network interface instead of auto-selecting one.
Network interface
A system communication endpoint, such as an Ethernet or InfiniBand device, used for data transfer between hosts.
NGC container
A container image distributed through NVIDIA GPU Cloud, typically prebuilt for AI, HPC, and accelerated computing workloads.
NodeFail
A Slurm job or node state indicating that the scheduler marked the node as failed or unusable during job execution.
NVIDIA_VISIBLE_DEVICES
Environment variable used in NVIDIA container environments to specify which GPU devices are visible inside a container.
nvidia-smi
NVIDIA System Management Interface command-line tool for querying and controlling GPU status and settings.
nvidia-smi -pm 1
Command that enables GPU persistence mode using NVIDIA SMI.

O

Out of memory
A failure condition where available GPU memory is insufficient for the requested CUDA or NCCL operation.

P

PCIe
Peripheral Component Interconnect Express, a high-speed interface used to connect GPUs and other devices to the system.
Persistence mode
A GPU driver mode that keeps the NVIDIA driver loaded and devices initialized between application runs.
Port error counters
InfiniBand or network interface statistics that track link-level errors and other communication faults on a port.

R

Retired memory pages
GPU memory pages that have been removed from service due to single-bit or double-bit ECC errors.

S

Slurm
An open-source workload manager and job scheduler widely used in HPC and AI clusters.

T

Thermal throttling
Performance reduction caused by temperature limits being reached, forcing the GPU to slow down to protect hardware.

X

Xid 13
An NVIDIA GPU error code typically indicating a graphics engine exception, often associated with an application-side fault.
Xid 63
An NVIDIA GPU error code indicating ECC page retirement caused by memory errors.
Xid 79
An NVIDIA GPU error code indicating the GPU has fallen off the bus, usually due to a severe PCIe or hardware issue.

About These Definitions

These definitions are loaded from the shared release pack. Use them with the study guide and practice questions to connect vocabulary to exam scenarios.