I just updated two corpus files (second and third), as they had invalid XML encoding information. They used utf8 instead of UTF-8.
Unfortunately, XMLStarlet – which I used to pretty print the XML – accepted utf8 as valid value and put it right into the file. This did not affect everybody, but at least one person was stumped enough to email, so I felt it should be fixed.
Those who do not want to redownload the file and can open large XML files (e.g. in Vim), can fix the problem by themselves, as it is extremely obvious right at the end of the first line.

It would be really useful to say somewhere which six languages these are. I’m trying to find out if Japanese is one of them, and I’m not getting very far.
The languages are the official languages of the United Nations: Arabic, Chinese, English, French, Russian, Spanish.
Sorry, no Japanese.
Hi
This is probably another stupid question, but is the source language for all of the paragraphs actually English or are there different source languages involved?
It is a good question.
For United Nations corpus, English is the original source language. For European Union, it is different.