NCA-GENL Practice Q35

A. Scaling by sequence length

B. Scaling by square root of key dimension

In the original Transformer formulation, the attention score is computed as \(QK^T / \sqrt{d_k}\), where \(d_k\) is the dimensionality of the key vectors; this normalization is the specific meaning of “scaled” in the name. The divisor is \(\sqrt{d_k}\), not an arbitrary constant, and it is used to prevent the raw dot products from growing too large as vector dimension increases, which would otherwise make the softmax overly peaky.

C. Scaling by number of heads

D. Scaling by batch size

Question 35

Explanation

Why each option is right or wrong