BLEU Metric Discussion

I’ve been doing some research on the BLEU metric and I was just wanting to hear everybody else’s thoughts on the metric. To anybody that hasn’t read about BLEU yet that is interested, here is the paper that proposes it as a way to judge Machine Translation http://aclweb.org/anthology/P/P02/P02-1040.pdf. A few issues that I have with the BLEU Metric:

  1. It requires a corpus to already be translated by humans.
  2. It doesn’t work well on a single sentence, but is rather built to be judged across a large corpus.
  3. Assumes the closer to a human translation, the better a sentence is.

Things I like about BLEU:

  1. It is a nice paper
  2. It was first to market and has other papers that use it as a metric
  3. Does a good job of keeping the judging of machine translation models out of the hands of people which is what it appears was happening before this metric was introduced.
  4. Not computationally intensive and seems pretty easy to calculate (still figuring out exactly how, but it doesn’t seem that difficult)

Questions about BLEU:

  1. What other alternatives are there to BLEU today?
  2. Are there any fields that still require a person to validate whether a model’s results are good or not?

Hopefully this is an interesting section to somebody. I definitely think that this was a game changer metric, however, it also seems like it is running out of steam since it requires a corpus to already have translations done and the new papers that are just utilizing the same latent space, but don’t have direct translations, make this metric impossible to calculate accurately. I’m also wondering if the assumption of a human translation being the best guess will make this metric hard to implement on less known language and possibly even languages that don’t have any translators left. Anything from this list would probably cause issues with the metric: http://www.endangeredlanguages.com/.

1 Like

I came across this paper which used ROUGE-L.

Below paper is released on 5 March 2018. Lots of work need to be done in NLP.
ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

1 Like

I have read BLEU and tried to use it inevaluating the output of three MT sytems between English and Arabic. The results show that BLEU has a poor results since it looks only for word matching and ignoring the linguage variation of the sentence besides the inability of the metric to understand the synonyms. therefore, i do not reckon using it.