How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
EMNLP• 2018
Abstract
Many recent papers address reading comprehension, where examples consist of
(question, passage, answer) tuples. Presumably, a model must combine
information from both questions and passages to predict corresponding answers.
However, despite intense interest in the topic, with hundreds of published
papers vying for leaderboard dominance, basic questions about the difficulty of
many popular benchmarks remain unanswered. In this paper, we establish sensible
baselines for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding
that question- and passage-only models often perform surprisingly well. On
out of bAbI tasks, passage-only models achieve greater than
accuracy, sometimes matching the full model. Interestingly, while CBT provides
-sentence stories only the last is needed for comparably accurate
prediction. By comparison, SQuAD and CNN appear better-constructed.