Exit 22 / 40

Question 22

Domain 2: Core Machine Learning, AI, and Transformer Foundations

Why do decoder-only architectures like GPT use causal (masked) self-attention?