Question 11
Section 4A product team wants to compare two GenAI prompt designs for support-answer quality. Which evaluation approach is strongest?
Correct answer: C
Explanation
Representative test cases ensure the prompts are measured on realistic inputs, while quality criteria define what “support-answer quality” means. Human review and error tracking add judgment and feedback on failures, which is why this approach is strongest for comparing GenAI prompt designs.
Why each option is right or wrong
A. Pick the prompt with the longest answers
Longer answers are not necessarily more accurate, helpful, or safer for support use.
B. Choose the prompt that uses the most technical words
Technical wording measures style, not whether the answer actually solves the customer problem.
C. Use representative test cases, quality criteria, human review, and error tracking
The strongest evaluation design is the one that combines a realistic test set with explicit scoring criteria and a human-in-the-loop review, because prompt quality is not a single numeric metric and must be judged against the actual support scenarios the model will face. In practice, this means using representative cases, defined quality dimensions, and tracking failure modes over time so the comparison is based on observed performance rather than subjective impressions; there is no governing statute here, so the controlling standard is methodological rigor rather than a legal rule.
D. Avoid evaluation because prompts cannot be tested
Prompts can be evaluated with test sets, rubrics, and reviewers like other AI outputs.