Databases for Statistical Machine Translation

 

English-Spanish

  • EPPS Word Alignment trial and test data

    The bilingual texts have been extracted from the Final Edition of the European Parliament Proceedings, available from the European Parliament's website.
    For our reference corpus, 500 sentences of at most 100 words have been selected at random in the English-Spanish training corpus used for March 2004 TC-STAR evaluation (data from July 1999 to September 2004). This collection contains 14691 English words and 15458 Spanish words. In order to facilitate comparisons between partners, the data set has been split in a 100 sentence pairs development corpus and a 400 sentence pairs test corpus.