Why are Sensitive Functions Hard for Transformers?
ACL• 2024
Abstract
Empirical studies have identified a range of learnability biases and
limitations of transformers, such as a persistent difficulty in learning to
compute simple formal languages such as PARITY, and a bias towards low-degree
functions. However, theoretical understanding remains limited, with existing
expressiveness theory either overpredicting or underpredicting realistic
learning abilities. We prove that, under the transformer architecture, the loss
landscape is constrained by the input-space sensitivity: Transformers whose
output is sensitive to many parts of the input string inhabit isolated points
in parameter space, leading to a low-sensitivity bias in generalization. We
show theoretically and empirically that this theory unifies a broad array of
empirical observations about the learning abilities and biases of transformers,
such as their generalization bias towards low sensitivity and low degree, and
difficulty in length generalization for PARITY. This shows that understanding
transformers' inductive biases requires studying not just their in-principle
expressivity, but also their loss landscape.