Question 37
Domain 4: Guidelines for Responsible AIWhich metrics are commonly used to evaluate generative AI model performance?
Correct answer: C
Explanation
ROUGE, BLEU, BERTScore, and LLM-as-a-judge are all listed as technical evaluation metrics for generative outputs, so “All of the above” fits. The source says “ROUGE,” “BLEU,” and “BERTScore” are used for output quality, and the table also includes “LLM-as-a-judge” for generative task scoring.
Why each option is right or wrong
A. BLEU score for translation quality
BLEU measures machine-translation precision against reference n-grams, with a brevity penalty.
B. ROUGE score for summarization quality
ROUGE measures summarization recall against reference text, but it is not the only common metric.
C. All of the above
Each of the listed options is a valid answer; all are needed.