Hello UN Corpora

The UN Corpora website is now live with the files, a basic description and links to tools and this blog.

A quick start:

  1. Download and unpack the corpus
  2. Download and install XMLStarlet
  3. Run the following command (C:\install_path\xml.exe for Windows, xmlstarlet for Unix):
    xmlstarlet sel -e utf8 -t -m “//tu[.//hi/@type='lead']” -v “@tuid” -m “tuv” -o “	” -m “.//hi[@type='lead']” -v “.” -b -b -n uncorpora_20090710.tmx > phrases.txt