NCP-AAI Practice Q21

A. Select the model with the fastest average response time across tasks, since latency dominates user-perceived quality in interactive agents.

B. Use task-specific evaluation metrics and compare per-task performance

No single aggregate score is defensible here because the tasks are heterogeneous: classification is typically scored with accuracy, precision/recall, or F1; question answering with exact match and token-level F1; and summarization with ROUGE-style overlap metrics. The correct comparison is therefore to score each candidate separately on each task using the metric matched to that task, then compare those per-task results rather than collapsing unlike outputs into one number.

C. Count the number of tokens each agent consumes across all tasks.

D. Choose the model with the highest accuracy on any single task as a strong general-purpose proxy for overall performance.

Question 21

Explanation

Why each option is right or wrong