Question 35
Domain 2: Core Machine Learning, AI, and Transformer FoundationsIn the context of attention mechanisms, what does "scaled" refer to in scaled dot-product attention?
Correct answer: B
Explanation
“Scaled” refers to dividing the dot product by the square root of the key dimension, written as \(\sqrt{d_k}\). This keeps the attention scores from becoming too large as the key vectors grow, which stabilizes the softmax calculation in scaled dot-product attention.
Why each option is right or wrong
A. Scaling by sequence length
B. Scaling by square root of key dimension
In the original Transformer formulation, the attention score is computed as \(QK^T / \sqrt{d_k}\), where \(d_k\) is the dimensionality of the key vectors; this normalization is the specific meaning of “scaled” in the name. The divisor is \(\sqrt{d_k}\), not an arbitrary constant, and it is used to prevent the raw dot products from growing too large as vector dimension increases, which would otherwise make the softmax overly peaky.
C. Scaling by number of heads
D. Scaling by batch size