Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers
ACLJun 29, 2021Outstanding Paper
This paper presents the first large-scale meta-evaluation of machine
translation (MT). We annotated MT evaluations conducted in 769 research papers
published from 2010 to 2020. Our study shows that practices for automatic MT
evaluation have dramatically changed during the past decade and follow
concerning trends. An increasing number of MT evaluations exclusively rely on
differences between BLEU scores to draw conclusions, without performing any
kind of statistical significance testing nor human evaluation, while at least
108 metrics claiming to be better than BLEU have been proposed. MT evaluations
in recent papers tend to copy and compare automatic metric scores from previous
work to claim the superiority of a method or an algorithm without confirming
neither exactly the same training, validating, and testing data have been used
nor the metric scores are comparable. Furthermore, tools for reporting
standardized metric scores are still far from being widely adopted by the MT
community. After showing how the accumulation of these pitfalls leads to
dubious evaluation, we propose a guideline to encourage better automatic MT
evaluation along with a simple meta-evaluation scoring method to assess its
credibility.