Question 1
Domain 2: Core Machine Learning, AI, and Transformer Foundations**Architecture component question:** What is the main purpose of layer normalization in transformer networks?
Correct answer: B
Explanation
Layer normalization rescales activations within each layer so their mean and variance stay controlled, which reduces internal covariate shift and stabilizes gradients during training. In transformer networks, this makes optimization more reliable and helps the model train effectively at greater depth, enabling deeper networks.
Why each option is right or wrong
A. To reduce overfitting during training
B. To stabilize training and enable deeper networks
Layer normalization is applied to the hidden-state vector within each transformer block, using the statistics of that single example rather than a batch, so the activations are normalized to zero mean and unit variance before the residual path continues. In the standard transformer architecture, this is what keeps gradients and activation scales well-behaved across many stacked layers, which is why it supports stable optimization and makes very deep networks trainable.
C. To compress model representations
D. To speed up inference computation