The UN Corpora website is now live with the files, a basic description and links to tools and this blog.
A quick start:
- Download and unpack the corpus
- Download and install XMLStarlet
- Run the following command (C:\install_path\xml.exe for Windows, xmlstarlet for Unix):
xmlstarlet sel -e utf8 -t -m “//tu[.//hi/@type='lead']” -v “@tuid” -m “tuv” -o “	” -m “.//hi[@type='lead']” -v “.” -b -b -n uncorpora_20090710.tmx > phrases.txt
