Question 25
Domain 4You have deployed a scikit-learn model to a Vertex AI endpoint using a custom model server. You enabled autoscaling; however, the deployed model fails to scale beyond one replica, leading to dropped requests. You notice that CPU utilization remains low even during periods of high load. What should you do?
Correct answer: B
Explanation
Vertex AI autoscaling for custom containers commonly uses CPU utilization as the scaling signal, so low CPU can prevent additional replicas even when requests are queued. Increasing the number of workers in the model server lets the container handle more concurrent requests and raise CPU usage, which triggers autoscaling and reduces dropped requests.
Why each option is right or wrong
A. Attach a GPU to the prediction nodes.
B. Increase the number of workers in your model server.
Vertex AI’s autoscaler for custom containers scales on the configured utilization metric, and for CPU-based scaling it will not add replicas until the container’s CPU crosses the target threshold. Under the Vertex AI custom prediction container contract, the model server must also be able to process concurrent requests; with only one worker, the process can remain underutilized and keep CPU below the autoscaling trigger even while requests queue. Increasing workers raises in-container concurrency and CPU consumption, which is what allows autoscaling to move past 1 replica and stop the request drops.
C. Schedule scaling of the nodes to match expected demand.
D. Increase the minReplicaCount in your DeployedModel configuration.