Another United Nations parallel corpus: WIPO Corpus Of Parallel Patent Applications

WIPO (World Intellectual Property Organization) has released a Free for Research parallel corpus of the Patent applications in English and French (COPPA). It is significantly larger than my corpus, but is obviously in only two out of six official languages.

It should be useful for machine learning techniques and – I suspect – would be quite interesting and challenging for Named Entity Recognition tasks.

They are also using TMX file format to store the aligned content. I started writing some tools to manipulate TMX format for my corpus, but stopped for now. However, if more non-translation projects will start using TMX, it might be worth revisiting the tools.

The research paper describing the corpus by Bruno Pouliquen and Christophe Mazenc is also now available from the MT Summit XIII (same conference I had my poster at a couple of years ago).

Being useful to translators

UN Corpora site contains the corpus version in the TMX form ready for importing into a Computer Assisted Translation tool, like Trados.

So, Jost Zetzsche mentioned the corpus in the 190th issue of The Tool Kit – the technology newsletter for translators, and also in his tweet (as Jeromobot) as part of collection of corpora useful for translators.

Thank you. And thank you for the Tool Kit; it is always an interesting read.

(Mis) Understanding United Nations documents

Understanding UN documents is not a small undertaking. Looking at a couple of examples a person starts to get a feeling that there is a clear pattern of symbols, editorial conventions and other ways of naming, citing and using United Nations documents. Unfortunately, this usually means that the person just did not look at wide enough sample set.

Let’s just take a look at most basic document symbols. I am not going to go into full research mode on it. David N. Griffiths has written quite an article covering UN Classificiation Scheme with full librarian’s intensity. I just want to look at one example here.

The basic symbol format for the General Assembly’s document currently is A/<sessionNum>/<docNum>, where sessionNum is the number of GA session (currently 64th) and docNum is a sequence number within that session.

With that in mind, what would the symbol number be for the Rules of Procedure of the General Assembly, a document last produced in 2008? Would it be then A/62/docNum, maybe even with docNum always being the same, similar to the Report of the Secretary-General on the work of the organization (A/62/1, A/63/1, A/64/1 and – in a month or so – A/65/1).

How about A/520/Rev.17? Does this look like a recent document of a General Assembly (GA)? Not really: there is either a missing session number or a session number from far in the future, but no document number. And what’s Rev.17?

Turns out that actually A/<docNum> is a valid GA document system, just not from the last 30 years or so. In the beginning of the General Assembly, all documents were just numbered sequentially, but after the 30th session, the numbering was changed to include the session number (We can thank Mr. Jean Gazarian for that). The Rules document was first written down in 1947 and using the numbering scheme from back then was assigned symbol A/520.

Since then, the Rules stood outside of time and normal symbol naming rules. Instead, it had a revision every couple of years adding to the base A/520 symbol. One assumes the reasoning is that the Rules apply to more than one session of the General Assembly and that’s why they are not tied to any. Maybe it even makes sense. But it does raise the question – how many other documents are outside of the normal rules right now and how would one go about discovering them.

Two corpus files reposted

I just updated two corpus files (second and third), as they had invalid XML encoding information. They used utf8 instead of UTF-8.

Unfortunately, XMLStarlet – which I used to pretty print the XML – accepted utf8 as valid value and put it right into the file. This did not affect everybody, but at least one person was stumped enough to email, so I felt it should be fixed.

Those who do not want to redownload the file and can open large XML files (e.g. in Vim), can fix the problem by themselves, as it is extremely obvious right at the end of the first line.

New corpus version, updated tool, copy of the paper are now available

The main website has been updated with a number of items, just in time for the MT Summit XII poster session:

  1. The code repository has been updated. Now specific languages can be extracted, vote segments removed, footnotes removed and in-paragraph annotations flattened.
  2. A new version of the corpus has been uploaded processed by the tool above to to remove footnotes and flatten the in-paragraph annotations. Otherwise, the content is the same. This version is more suitable for direct import into commercial tools, which may not be implementing the fancier bits of TMX.
  3. The paper describing the corpus is now linked to.