NCA-GENL Practice Q6

A. To reduce overfitting

B. To stabilize training and enable deeper networks

Layer normalization is applied to the hidden states in each transformer block, typically across the feature dimension for each token, so the activations stay numerically well-behaved during both forward and backward propagation. In transformer architectures, this stabilization is what prevents gradients and activations from becoming erratic, which is especially important when stacking many layers; the practical effect is more reliable optimization and the ability to train deeper models without divergence.

C. To compress model size

D. To speed up inference

Question 6

Explanation

Why each option is right or wrong