NCP-AAI Practice Q22

A. Deploy all models on CPU and optimize with multithreading to process modalities in parallel.

B. Deploy each model as a separate TensorRT-optimized microservice on NVIDIA GPUs using Triton Inference Server, with parallel processing of independent modalities and efficient GPU sharing across models.

NVIDIA Triton Inference Server is the intended serving layer for multi-model, multi-framework inference on GPUs, and it supports concurrent execution, dynamic batching, and model instance scheduling across shared NVIDIA GPU resources. Pairing each stage with TensorRT is appropriate because TensorRT is NVIDIA’s inference optimizer/compiler for low-latency deployment on supported GPUs, so independent modalities can be run in parallel as separate services while still sharing the same GPU pool efficiently.

C. Fine-tune a single multimodal model that handles all inputs, deploying on a single GPU.

D. Deploy all models on a single GPU sequentially, using TensorRT optimization for each model.

Question 22

Explanation

Why each option is right or wrong