NCA-GENL Practice Q36

A. To reduce computational cost

B. To focus on different types of relationships simultaneously

Multi-head attention splits the model’s representation into several parallel attention operations, each with its own learned projection matrices for queries, keys, and values. In the standard Transformer formulation (Vaswani et al., 2017), this lets different heads attend to different positions and dependency patterns at the same time, rather than forcing a single attention map to capture all relationships with one set of weights. The result is a richer representation because one head may track local syntax while another captures longer-range or semantic links.

C. To handle longer sequences

D. To improve memory efficiency

Question 36

Explanation

Why each option is right or wrong