Hello UN Corpora

The UN Corpora website is now live with the files, a basic description and links to tools and this blog.

A quick start:

  1. Download and unpack the corpus
  2. Download and install XMLStarlet
  3. Run the following command (C:\install_path\xml.exe for Windows, xmlstarlet for Unix):
    xmlstarlet sel -e utf8 -t -m “//tu[.//hi/@type='lead']” -v “@tuid” -m “tuv” -o “	” -m “.//hi[@type='lead']” -v “.” -b -b -n uncorpora_20090710.tmx > phrases.txt

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>