NCA-GENL Practice Q37

A. To reduce the total number of parameters in the model by sharing weight matrices across positions and decomposing large attention operations

B. To speed up inference time by distributing the attention computation across multiple GPUs, enabling hardware-level parallelism for each head

C. To allow the model to attend to information from different representation subspaces at different positions

Vaswani et al., *Attention Is All You Need* (2017), §3.2.2 defines multi-head attention as projecting queries, keys, and values into multiple learned linear subspaces and applying attention in parallel, with each head using its own parameter matrices. The point is that a single attention map can only produce one weighted view, whereas multiple heads let the model capture distinct positional and semantic relationships simultaneously across different subspaces, then concatenate and reproject them.

D. To prevent gradient vanishing during backpropagation by creating multiple independent gradient pathways through the attention layers

Question 37

Explanation

Why each option is right or wrong