Sekundarni povzetek: |
An important task of developing machine translation (MT) is evaluating system performance. Automatic measures are most commonly used for this task, as manual evaluation is time-consuming and costly. However, to perform an objective evaluation is not a trivial task. Automatic measures, such as BLEU, TER, NIST, METEOR etc., have their own weaknesses, while manual evaluations are also problematic since they are always to some extent subjective.
In this paper we test the influence of a test set on the results of automatic MT evaluation for the subtitling domain. Translating subtitles is a rather specific task for MT, since subtitles are a sort of summarization of spoken text rather than a direct translation of (written) text. Additional problem when translating language pair that does not include English, in our example Slovene-Serbian, is that commonly the translations are done from English to Serbian and from English to Slovenian, and not directly, since most of the TV production is originally filmed in English.
All this poses additional challenges to MT and consequently to MT evaluation. Automatic evaluation is based on a reference translation, which is usually taken from an existing parallel corpus and marked as a test set. In our experiments, we compare the evaluation results for the same MT system output using three types of test set. In the first round, the test set are 4000 subtitles from the parallel corpus of subtitles SUMAT. These subtitles are not direct translations from Serbian to Slovene or vice versa, but are based on an English original. In the second round, the test set are 1000 subtitles randomly extracted from the first test set and translated anew, from Serbian to Slovenian, based solely on the Serbian written subtitles. In the third round, the test set are the same 1000 subtitles, however this time the Slovene translations were obtained by manually correcting the Slovene MT outputs so that they are correct translations of the Serbian subtitles.
The results of MT evaluation were calculated for the metrics NIST, BLEU and TER. They were strikingly diverse, even though the system output was always the same: when calculated on the original translations from the parallel corpus, BLEU was 19.47%, TER 65.27% and NIST 5.05; when calculated on directly translated subtitles from Serbian to Slovenian, BLEU was 43.10%, TER 32.91% and NIST 7.78; when calculated on the manually corrected MT output, BLEU (also so-called hBLEU) was 71.6%, (h)TER 14.1% and (h)NIST 10.62. |