Corpora of the United Nations for the research purposes

This website contains document collections from the United Nations, collected together for research purposes.

United Nations General Assembly Resolutions: A Six-Language Parallel Corpus

The first available corpus is a paragraph-aligned six-language collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62. The corpus is described in an academic paper that was presented (as a poster) at Machine Translation Summit XII on August 28th, 2009.

The corpus is available in three versions, all as zipped TMX (XML) files:

  1. Machine-friendly version (49.4/176.2 MBytes). This version contains no newlines or insignificant whitespaces and is the version described in the paper.
  2. Human-friendly version (50/187 MBytes). This slightly-larger version has been pretty-printed to make it easier to review its content or to process it with non-XML tools, like grep. I recommend downloading this version.
  3. Plain TM version (40.9/155.6 MBytes). In this version, voting segments are removed, footnotes are removed completely and symbols and lead markers are removed (but the content is kept). This is a version suitable for import into commercial TM tools, which may not be implementing full TMX spec.

If you use this corpus for research purposes, please cite Alexandre Rafalovitch, Robert Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292-299, Ottawa, Canada, August. CiteULike record for the paper is also available.

I recommend XMLStarlet as a tool for quick ad-hoc queries over the data. It is a good find/grep/pretty-printer equivalent for XML. An introduction to XMLStarlet is available.

The ongoing conversion about this corpus will be at the blog. Please subscribe to the feeds and join the discussion. For personal questions or comments, you can contact me at gmail.com with username arafalov.

The non-trivial utilities for manipulating the corpus will appear over time at the Google Code project repository for the site


P.s. This site will get nicer looking later. Some time later...

Page last updated: 4th of October 2009