Overview

This website is a supplementary resource to the academic paper:
Alexandre Rafalovitch, Robert Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292-299, Ottawa, Canada, August

In the paper we describe a fully parallel public-domain corpus consisting of 2100 United Nations General Assembly Resolutions with translations in the six official languages of the United Nations (Arabic, Chinese, English, French, Russian, and Spanish). The documents came from Volume I of the General Assembly regular sessions 55-62. There are on average around 3 million tokens per language.

We describe the background to the corpus and its content, the process of its construction, and some of its interesting properties.

The corpus is available in a pre-processed, formatting-normalized TMX format with paragraphs aligned across multiple languages.

The paper was presented (as a poster) at the Machine Translation Summit XII on August 28th, 2009.

Corpus download

The corpus is available in three versions, all as zipped TMX (XML) files:

Human-friendly version (53 MBytes zipped/196 MBytes uncompressed)

Pretty-printed version to make it easier to review its content or to process it with non-XML tools, like grep. I recommend downloading this version.

Sample showing the layout, highlights, and footnotes:

<tu tuid="55_100:14">
  <prop type="lead">true</prop>
  <tuv xml:lang="EN">
    <seg><hi type="lead">Recalling</hi> the provisions of the Universal Declaration of Human Rights,<sub type="fnote">Resolution 217 A (III).</sub> as well as article 12 of the International Covenant on Civil and Political Rights,<sub type="fnote">See resolution 2200 A (XXI), annex.</sub></seg>
  </tuv>
  <tuv xml:lang="AR">
    <seg><hi type="lead">وإذ تشير</hi> إلى أحكام الإعلان العالمي لحقوق الإنسان(<sub type="fnote">[1]) القرار 217 ألف (د - 3).</sub>) وإلى المادة 12 من العهد الدولي الخاص بالحقوق المدنية والسياسية(<sub type="fnote">[1]) انظر القرار 2200 ألف (د - 21)، المرفق.</sub>)،</seg>
  </tuv>
  <tuv xml:lang="ZH">
    <seg><hi type="lead">回顾</hi>《世界人权宣言》<sub type="fnote">第217 A(III)号决议。</sub> 的各项规定,以及《公民及政治权利国际盟约》第12条,<sub type="fnote">见第2200 A(XXI)号
决议,附件。</sub></seg>
  </tuv>
  <tuv xml:lang="FR">
    <seg><hi type="lead">Rappelant</hi> les dispositions de la Déclaration universelle des droits de l'homme<sub type="fnote">Résolution 217 A (III).</sub>, ainsi que l'article 12 du Pacte international relatif aux droits civils et politiques<sub type="fnote">Voir résolution 2200 A (XXI), annexe.</sub>,</seg>
  </tuv>
  <tuv xml:lang="RU">
    <seg><hi type="lead">ссылаясь</hi> на положения Всеобщей декларации прав человека<sub type="fnote">Резолюция 217 A (III).</sub>, а также на статью 12 Международного пакта о гражданских и политических правах<sub type="fnote">См. резолюцию 2200 А (XXI), приложение.</sub>,</seg>
  </tuv>
  <tuv xml:lang="ES">
    <seg><hi type="lead">Recordando</hi> las disposiciones de la Declaración Universal de Derechos Humanos<sub type="fnote">Resolución 217 A (III).</sub>, y el artículo 12 del Pacto Internacional de Derechos Civiles y Políticos<sub type="fnote">Véase resolución 2200 A (XXI), anexo.</sub>,</seg>
  </tuv>
</tu>
    

Machine-friendly version (52/185 MBytes)

This version contains no newlines or insignificant whitespaces and is the version described in the paper.

Plain TM version (43/163 MBytes)

In this version, voting segments are removed, footnotes are removed completely and symbols and lead markers are removed (but the content is kept). This is a version suitable for import into commercial TM tools, which may not be implementing the full TMX spec.

Sample showing the simplified layout:

<tu tuid="55_100:14">
  <prop type="lead">true</prop>
  <tuv xml:lang="EN">
    <seg>Recalling the provisions of the Universal Declaration of Human Rights, as well as article 12 of the International Covenant on Civil and Political Rights,</seg>
  </tuv>
  <tuv xml:lang="AR">
    <seg>وإذ تشير إلى أحكام الإعلان العالمي لحقوق الإنسان() وإلى المادة 12 من العهد الدولي الخاص بالحقوق المدنية والسياسية()،</seg>
  </tuv>
  <tuv xml:lang="ZH">
    <seg>回顾《世界人权宣言》 的各项规定,以及《公民及政治权利国际盟约》第12条,</seg>
  </tuv>
  <tuv xml:lang="FR">
    <seg>Rappelant les dispositions de la Déclaration universelle des droits de l'homme, ainsi que l'article 12 du Pacte international relatif aux droits civils et politiques,</seg>
  </tuv>
  <tuv xml:lang="RU">
    <seg>ссылаясь на положения Всеобщей декларации прав человека, а также на статью 12 Международного пакта о гражданских и политических правах,</seg>
  </tuv>
  <tuv xml:lang="ES">
    <seg>Recordando las disposiciones de la Declaración Universal de Derechos Humanos, y el artículo 12 del Pacto Internacional de Derechos Civiles y Políticos,</seg>
  </tuv>
</tu>
    

Public usage

Similar resources

United Nations Parallel Corpus
More recent and much larger sentence-aligned corpus in TEI-encoded format
Described in Ziemski, Junczys-Dowmunt, and Pouliquen paper from LREC'16
WIPO's Corpus of Parallel Patent Applications (COPPA)
Parallel English-French corpus of WIPO's PCT applications (title and abstract) published between 1990 and 2010 in TMX format
Described in Pouliquen and Mazenc paper from the MT Summit XIII

Support utilities

Current academic citations

If using the corpus, please cite:
Alexandre Rafalovitch, Robert Dale. 2009. United Nations General Assembly Resolutions: A Six-Language Parallel Corpus. In Proceedings of the MT Summit XII, pages 292-299, Ottawa, Canada, August

For personal questions or comments, you can contact me at gmail.com with username arafalov

Page last updated: July 2016