Random matrix analysis reveals capacity bottlenecks in transformer multi-head attention
Assistant Professor Brandon Reagen and Ph.D. candidate Nandan Kumar Jha use random matrix theory to analyze capacity bottlenecks in transformer multi-head attention mechanisms. Their study reveals that compression techniques called multi-head latent attention can either preserve or limit learning capacity depending on implementation. MLA-Decoupled consistently sustains over 60% normalized rank across all layers, while standard attention collapses after five layers. The research demonstrates that sharing key components across processing heads avoids limitations and offers pathways toward more efficient AI models.