New corpus version, updated tool, copy of the paper are now available

The main website has been updated with a number of items, just in time for the MT Summit XII poster session:

  1. The code repository has been updated. Now specific languages can be extracted, vote segments removed, footnotes removed and in-paragraph annotations flattened.
  2. A new version of the corpus has been uploaded processed by the tool above to to remove footnotes and flatten the in-paragraph annotations. Otherwise, the content is the same. This version is more suitable for direct import into commercial tools, which may not be implementing the fancier bits of TMX.
  3. The paper describing the corpus is now linked to.

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>