Question 21
Domain 2: Evaluation, Tuning, and Quality OptimizationYou’re tasked with selecting the better agent between two candidates based on performance across multiple tasks, including classification, question answering, and summarization. What is the best strategy to compare their performance?
Correct answer: B
Explanation
Use task-specific evaluation metrics because classification, question answering, and summarization measure different abilities and require different standards. Comparing “per-task performance” lets you judge each agent on the right criterion instead of averaging unlike tasks into one score.
Why each option is right or wrong
A. Select the model with the fastest average response time across tasks, since latency dominates user-perceived quality in interactive agents.
B. Use task-specific evaluation metrics and compare per-task performance
No single aggregate score is defensible here because the tasks are heterogeneous: classification is typically scored with accuracy, precision/recall, or F1; question answering with exact match and token-level F1; and summarization with ROUGE-style overlap metrics. The correct comparison is therefore to score each candidate separately on each task using the metric matched to that task, then compare those per-task results rather than collapsing unlike outputs into one number.
C. Count the number of tokens each agent consumes across all tasks.
D. Choose the model with the highest accuracy on any single task as a strong general-purpose proxy for overall performance.