Bayesian Model Selection, the Marginal Likelihood, and Generalization
ICML• 2022
Abstract
How do we compare between hypotheses that are entirely consistent with
observations? The marginal likelihood (aka Bayesian evidence), which represents
the probability of generating our observations from a prior, provides a
distinctive approach to this foundational question, automatically encoding
Occam's razor. Although it has been observed that the marginal likelihood can
overfit and is sensitive to prior assumptions, its limitations for
hyperparameter learning and discrete model comparison have not been thoroughly
investigated. We first revisit the appealing properties of the marginal
likelihood for learning constraints and hypothesis testing. We then highlight
the conceptual and practical issues in using the marginal likelihood as a proxy
for generalization. Namely, we show how marginal likelihood can be negatively
correlated with generalization, with implications for neural architecture
search, and can lead to both underfitting and overfitting in hyperparameter
learning. We also re-examine the connection between the marginal likelihood and
PAC-Bayes bounds and use this connection to further elucidate the shortcomings
of the marginal likelihood for model selection. We provide a partial remedy
through a conditional marginal likelihood, which we show is more aligned with
generalization, and practically valuable for large-scale hyperparameter
learning, such as in deep kernel learning.