NCA-GENL Practice Q3

A. It initializes all weights to zero across every layer to ensure perfectly symmetric learning signals across neurons, guaranteeing that each unit receives identical gradient updates during early training phases

B. It sets all weights to small constant values like 0.01 across the entire network to prevent gradient explosion, ensuring that activations remain bounded throughout the forward propagation process

C. It sets weights proportional to the number of input and output connections, helping maintain consistent variance of activations and gradients across layers

Glorot and Bengio’s initialization rule is defined in the 2010 paper *Understanding the difficulty of training deep feedforward neural networks*: for a layer with fan-in and fan-out, weights are drawn with variance about \(2/(fan\_in+fan\_out)\) (or the equivalent uniform bound \(\pm\sqrt{6/(fan\_in+fan\_out)}\)). That specific scaling is used to keep the variance of signals and backpropagated errors approximately constant from layer to layer, avoiding the vanishing/exploding behavior that is especially damaging in deep NLP models.

D. It initializes weights randomly from a standard uniform distribution between -1 and 1 regardless of network dimensions, providing sufficient randomness to break symmetry between neurons in each layer

Question 3

Explanation

Why each option is right or wrong