Two corpus files reposted

I just updated two corpus files (second and third), as they had invalid XML encoding information. They used utf8 instead of UTF-8.

Unfortunately, XMLStarlet – which I used to pretty print the XML – accepted utf8 as valid value and put it right into the file. This did not affect everybody, but at least [...]

New corpus version, updated tool, copy of the paper are now available

The main website has been updated with a number of items, just in time for the MT Summit XII poster session:

The code repository has been updated. Now specific languages can be extracted, vote segments removed, footnotes removed and in-paragraph annotations flattened. A new version of the corpus has been uploaded processed by the [...]

Hello UN Corpora

The UN Corpora website is now live with the files, a basic description and links to tools and this blog.

A quick start:

Download and unpack the corpus Download and install XMLStarlet Run the following command (C:\install_path\xml.exe for Windows, xmlstarlet for Unix): xmlstarlet sel -e utf8 -t -m “//tu[.//hi/@type='lead']” -v “@tuid” -m “tuv” -o [...]