Question 36
Domain 2: Core Machine Learning, AI, and Transformer FoundationsIn multi-head attention, why are multiple attention heads used?
Correct answer: B
Explanation
Multiple heads let the model attend to different parts of the input at the same time, so each head can learn a distinct pattern or relationship. This is why multi-head attention is used "to focus on different types of relationships simultaneously," improving the model’s ability to capture varied dependencies.
Why each option is right or wrong
A. To reduce computational cost
B. To focus on different types of relationships simultaneously
Multi-head attention splits the model’s representation into several parallel attention operations, each with its own learned projection matrices for queries, keys, and values. In the standard Transformer formulation (Vaswani et al., 2017), this lets different heads attend to different positions and dependency patterns at the same time, rather than forcing a single attention map to capture all relationships with one set of weights. The result is a richer representation because one head may track local syntax while another captures longer-range or semantic links.
C. To handle longer sequences
D. To improve memory efficiency