NCA-GENL Practice Q1

A. To reduce overfitting during training

B. To stabilize training and enable deeper networks

Layer normalization is applied to the hidden-state vector within each transformer block, using the statistics of that single example rather than a batch, so the activations are normalized to zero mean and unit variance before the residual path continues. In the standard transformer architecture, this is what keeps gradients and activation scales well-behaved across many stacked layers, which is why it supports stable optimization and makes very deep networks trainable.

C. To compress model representations

D. To speed up inference computation

Question 1

Explanation

Why each option is right or wrong