This website contains document collections from the United Nations, collected together for research purposes.
The first available corpus is a paragraph-aligned six-language collection of resolutions of the General Assembly from Volume I of GA regular sessions 55-62. The corpus is described in an academic paper that was presented (as a poster) at Machine Translation Summit XII on August 28th, 2009.
The corpus is available in three versions, all as zipped TMX (XML) files:
If you use this corpus for research purposes, please cite Alexandre Rafalovitch, Robert Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292-299, Ottawa, Canada, August. CiteULike record for the paper is also available.
I recommend XMLStarlet as a tool for quick ad-hoc queries over the data. It is a good find/grep/pretty-printer equivalent for XML. An introduction to XMLStarlet is available.
The ongoing conversion about this corpus will be at the blog. Please subscribe to the feeds and join the discussion. For personal questions or comments, you can contact me at gmail.com with username arafalov.
The non-trivial utilities for manipulating the corpus will appear over time at the Google Code project repository for the site
P.s. This site will get nicer looking later. Some time later...
Page last updated: 4th of October 2009