Question 22
Domain 1: Agent Architecture, Design, and DevelopmentWhat NVIDIA platform architecture would best optimize this multimodal pipeline?
Correct answer: B
Explanation
This architecture fits because Triton Inference Server is designed to serve multiple models on NVIDIA GPUs, and TensorRT optimizes each model for low-latency inference. Splitting the pipeline into separate microservices lets independent modalities run in parallel while enabling efficient GPU sharing across models.
Why each option is right or wrong
A. Deploy all models on CPU and optimize with multithreading to process modalities in parallel.
B. Deploy each model as a separate TensorRT-optimized microservice on NVIDIA GPUs using Triton Inference Server, with parallel processing of independent modalities and efficient GPU sharing across models.
NVIDIA Triton Inference Server is the intended serving layer for multi-model, multi-framework inference on GPUs, and it supports concurrent execution, dynamic batching, and model instance scheduling across shared NVIDIA GPU resources. Pairing each stage with TensorRT is appropriate because TensorRT is NVIDIA’s inference optimizer/compiler for low-latency deployment on supported GPUs, so independent modalities can be run in parallel as separate services while still sharing the same GPU pool efficiently.
C. Fine-tune a single multimodal model that handles all inputs, deploying on a single GPU.
D. Deploy all models on a single GPU sequentially, using TensorRT optimization for each model.