Question 3
Domain 2: Core Machine Learning, AI, and Transformer FoundationsWhy is Xavier (Glorot) initialization commonly used for initializing weights in deep neural networks for NLP tasks?
Correct answer: C
Explanation
Xavier (Glorot) initialization sets weights based on both the number of input and output connections, which helps keep activations and backpropagated gradients from shrinking or exploding across layers. This preserves a more stable variance through deep networks, making training easier for NLP models.
Why each option is right or wrong
A. It initializes all weights to zero across every layer to ensure perfectly symmetric learning signals across neurons, guaranteeing that each unit receives identical gradient updates during early training phases
B. It sets all weights to small constant values like 0.01 across the entire network to prevent gradient explosion, ensuring that activations remain bounded throughout the forward propagation process
C. It sets weights proportional to the number of input and output connections, helping maintain consistent variance of activations and gradients across layers
Glorot and Bengio’s initialization rule is defined in the 2010 paper *Understanding the difficulty of training deep feedforward neural networks*: for a layer with fan-in and fan-out, weights are drawn with variance about \(2/(fan\_in+fan\_out)\) (or the equivalent uniform bound \(\pm\sqrt{6/(fan\_in+fan\_out)}\)). That specific scaling is used to keep the variance of signals and backpropagated errors approximately constant from layer to layer, avoiding the vanishing/exploding behavior that is especially damaging in deep NLP models.
D. It initializes weights randomly from a standard uniform distribution between -1 and 1 regardless of network dimensions, providing sufficient randomness to break symmetry between neurons in each layer