Question 37
Domain 2: Core Machine Learning, AI, and Transformer FoundationsIn multi-head attention, why does the transformer use multiple attention heads instead of a single attention mechanism?
Correct answer: C
Explanation
Multi-head attention lets the model "attend to information from different representation subspaces at different positions," so each head can focus on different relationships in the input. This gives the transformer richer context than a single attention mechanism, which would compress all attention into one view.
Why each option is right or wrong
A. To reduce the total number of parameters in the model by sharing weight matrices across positions and decomposing large attention operations
B. To speed up inference time by distributing the attention computation across multiple GPUs, enabling hardware-level parallelism for each head
C. To allow the model to attend to information from different representation subspaces at different positions
Vaswani et al., *Attention Is All You Need* (2017), §3.2.2 defines multi-head attention as projecting queries, keys, and values into multiple learned linear subspaces and applying attention in parallel, with each head using its own parameter matrices. The point is that a single attention map can only produce one weighted view, whereas multiple heads let the model capture distinct positional and semantic relationships simultaneously across different subspaces, then concatenate and reproject them.
D. To prevent gradient vanishing during backpropagation by creating multiple independent gradient pathways through the attention layers