Question 40
Domain 2: Core Machine Learning, AI, and Transformer Foundations**Reported in multiple exam experiences:** What is the purpose of masked attention in transformer decoders?
Correct answer: B
Explanation
Masked attention in transformer decoders blocks each position from attending to later positions, so the model cannot use future tokens while predicting the next one. This preserves the autoregressive rule: the decoder must generate text using only "past" and current tokens during training.
Why each option is right or wrong
A. To reduce computational complexity
B. To prevent attention to future tokens during training
In a transformer decoder, the causal mask is applied so each position can only attend to positions at or before itself, which enforces the autoregressive training objective. This prevents information leakage from later tokens in the sequence; without it, the model would be able to condition on future words that would not be available at generation time.
C. To compress attention weights
D. To handle variable sequence lengths