WIPO (World Intellectual Property Organization) has released a Free for Research parallel corpus of the Patent applications in English and French (COPPA). It is significantly larger than my corpus, but is obviously in only two out of six official languages.
It should be useful for machine learning techniques and – I suspect – would be quite interesting and challenging for Named Entity Recognition tasks.
They are also using TMX file format to store the aligned content. I started writing some tools to manipulate TMX format for my corpus, but stopped for now. However, if more non-translation projects will start using TMX, it might be worth revisiting the tools.
The research paper describing the corpus by Bruno Pouliquen and Christophe Mazenc is also now available from the MT Summit XIII (same conference I had my poster at a couple of years ago).
