<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>UN Corpora Blog</title>
	<atom:link href="http://www.uncorpora.org/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.uncorpora.org/blog</link>
	<description>Researching United Nations Documents</description>
	<lastBuildDate>Mon, 07 Nov 2011 21:10:31 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.2</generator>
		<item>
		<title>Another United Nations parallel corpus: WIPO Corpus Of Parallel Patent Applications</title>
		<link>http://www.uncorpora.org/blog/2011/11/another-united-nations-parallel-corpus-wipo/</link>
		<comments>http://www.uncorpora.org/blog/2011/11/another-united-nations-parallel-corpus-wipo/#comments</comments>
		<pubDate>Mon, 07 Nov 2011 21:10:31 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Publicity]]></category>

		<guid isPermaLink="false">http://www.uncorpora.org/blog/?p=30</guid>
		<description><![CDATA[<p>WIPO (World Intellectual Property Organization) has released a Free for Research parallel corpus of the Patent applications in English and French (COPPA). It is significantly larger than my corpus, but is obviously in only two out of six official languages.</p> <p>It should be useful for machine learning techniques and &#8211; I suspect &#8211; would [...]]]></description>
			<content:encoded><![CDATA[<p>WIPO (World Intellectual Property Organization) has released <a href="http://www.wipo.int/patentscope/en/data/products.html#coppa">a Free for Research parallel corpus of the Patent applications in English and French (COPPA)</a>. It is significantly larger than my corpus, but is obviously in only two out of six official languages.</p>
<p>It should be useful for machine learning techniques and &#8211; I suspect &#8211; would be quite interesting and challenging for Named Entity Recognition tasks.</p>
<p>They are also using TMX file format to store the aligned content. I started writing some tools to manipulate TMX format for my corpus, but stopped for now. However, if more non-translation projects will start using TMX, it might be worth revisiting the tools.</p>
<p><a title="Paper describing the corpus" href="http://www.mt-archive.info/MTS-2011-Pouliquen.pdf">The research paper describing the corpus by Bruno Pouliquen and Christophe Mazenc</a> is also now available from the <a title="MT Summit XIII page" href="http://www.mt-archive.info/MTS-2011-TOC.htm">MT Summit XIII</a> (same conference I had my poster at a couple of years ago).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.uncorpora.org/blog/2011/11/another-united-nations-parallel-corpus-wipo/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Being useful to translators</title>
		<link>http://www.uncorpora.org/blog/2011/05/being-useful-to-translators/</link>
		<comments>http://www.uncorpora.org/blog/2011/05/being-useful-to-translators/#comments</comments>
		<pubDate>Fri, 06 May 2011 02:25:28 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Publicity]]></category>

		<guid isPermaLink="false">http://www.uncorpora.org/blog/?p=23</guid>
		<description><![CDATA[<p>UN Corpora site contains the corpus version in the TMX form ready for importing into a Computer Assisted Translation tool, like Trados.</p> <p>So, Jost Zetzsche mentioned the corpus in the 190th issue of The Tool Kit &#8211; the technology newsletter for translators, and also in his tweet (as Jeromobot) as part of collection of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.uncorpora.org/">UN Corpora site</a> contains the corpus version in the TMX form ready for importing into a Computer Assisted Translation tool, like Trados.</p>
<p>So, Jost Zetzsche mentioned the corpus in the 190th issue of <a href="http://www.internationalwriters.com/toolkit/"><em>The Tool Kit</em></a> &#8211; the technology newsletter for translators, and also in <a title="Jeromobot's tweet about this corpus" href="http://twitter.com/#!/Jeromobot/status/60399816395587584">his tweet</a> (as Jeromobot) as part of collection of corpora useful for translators.</p>
<p>Thank you. And thank you for the Tool Kit; it is always an interesting read.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.uncorpora.org/blog/2011/05/being-useful-to-translators/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>(Mis) Understanding United Nations documents</title>
		<link>http://www.uncorpora.org/blog/2010/08/mis-understanding-united-nations-documents/</link>
		<comments>http://www.uncorpora.org/blog/2010/08/mis-understanding-united-nations-documents/#comments</comments>
		<pubDate>Sat, 14 Aug 2010 20:22:40 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Understanding United Nations]]></category>

		<guid isPermaLink="false">http://www.uncorpora.org/blog/?p=20</guid>
		<description><![CDATA[<p>Understanding UN documents is not a small undertaking. Looking at a couple of examples a person starts to get a feeling that there is a clear pattern of symbols, editorial conventions and other ways of naming, citing and using United Nations documents. Unfortunately, this usually means that the person just did not look at [...]]]></description>
			<content:encoded><![CDATA[<p>Understanding UN documents is not a small undertaking. Looking at a couple of examples a person starts to get a feeling that there is a clear pattern of symbols, editorial conventions and other ways of naming, citing and using United Nations documents. Unfortunately, this usually means that the person just did not look at wide enough sample set.</p>
<p>Let&#8217;s just take a look at most basic document symbols. I am not going to go into full research mode on it. David N. Griffiths has written quite <a title="David N. Griffiths: 'The United Nations Classification Scheme: A Critique and Recommendations'" href="http://www.citeulike.org/article/4207459">an article covering UN Classificiation Scheme</a> with full librarian&#8217;s intensity. I just want to look at one example here.</p>
<p>The basic symbol format for the General Assembly&#8217;s document currently is <em>A/&lt;sessionNum&gt;/&lt;docNum&gt;</em>, where <em>sessionNum</em> is the number of GA session (currently 64th) and <em>docNum</em> is a sequence number within that session.</p>
<p>With that in mind, what would the symbol number be for the <a href="http://www.un.org/ga/ropga.shtml">Rules of Procedure of the General Assembly</a>, a document last produced in 2008? Would it be then <em>A/62/docNum</em>, maybe even with <em>docNum</em> always being the same, similar to the <em>Report of the Secretary-General on the work of the organization</em> (<a href="http://doc.un.org/DocBox/docbox.nsf/GetAll?OpenAgent&amp;DS=A/62/1%28SUPP%29">A/62/1</a>, <a href="http://doc.un.org/DocBox/docbox.nsf/GetAll?OpenAgent&amp;DS=A/63/1%28SUPP%29">A/63/1</a>, <a href="http://doc.un.org/DocBox/docbox.nsf/GetAll?OpenAgent&amp;DS=A/64/1%28SUPP%29">A/64/1</a> and &#8211; in a month or so &#8211; A/65/1).</p>
<p>How about <a href="http://doc.un.org/DocBox/docbox.nsf/GetAll?OpenAgent&amp;DS=A/520/Rev.17">A/520/Rev.17</a>? Does this look like a recent document of a General Assembly (GA)? Not really: there is either a missing session number or a session number from far in the future, but no document number. And what&#8217;s Rev.17?</p>
<p>Turns out that actually <em>A/&lt;docNum&gt;</em> is a valid GA document system, just not from the last 30 years or so. In the beginning of the General Assembly, all documents were just numbered sequentially, but after the 30th session, the numbering was changed to include the session number (We can thank <a href="http://www.unitar.org/fr/node/338">Mr. Jean Gazarian</a> for that). <em>The Rules</em> document was first written down in 1947 and using the numbering scheme from back then was assigned symbol <a href="http://doc.un.org/DocBox/docbox.nsf/GetAll?OpenAgent&amp;DS=A/520">A/520</a>.</p>
<p>Since then, <em>the Rules</em> stood outside of time and normal symbol naming rules. Instead, it had a revision every couple of years adding to the base <em>A/520</em> symbol. One assumes the reasoning is that <em>the Rules</em> apply to more than one session of the General Assembly and that&#8217;s why they are not tied to any. Maybe it even makes sense. But it does raise the question &#8211; how many other documents are outside of the normal rules right now and how would one go about discovering them.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.uncorpora.org/blog/2010/08/mis-understanding-united-nations-documents/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Two corpus files reposted</title>
		<link>http://www.uncorpora.org/blog/2009/08/two-corpus-files-reposted/</link>
		<comments>http://www.uncorpora.org/blog/2009/08/two-corpus-files-reposted/#comments</comments>
		<pubDate>Tue, 01 Sep 2009 02:59:43 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tools and Tips]]></category>

		<guid isPermaLink="false">http://www.uncorpora.org/blog/?p=14</guid>
		<description><![CDATA[<p>I just updated two corpus files (second and third), as they had invalid XML encoding information. They used utf8 instead of UTF-8.</p> <p>Unfortunately, XMLStarlet &#8211; which I used to pretty print the XML &#8211; accepted utf8 as valid value and put it right into the file. This did not affect everybody, but at least [...]]]></description>
			<content:encoded><![CDATA[<p>I just updated two corpus files (second and third), as they had invalid XML encoding information. They used <em>utf8</em> instead of <em>UTF-8</em>.</p>
<p>Unfortunately, XMLStarlet &#8211; which I used to pretty print the XML &#8211; accepted <em>utf8</em> as valid value and put it right into the file. This did not affect everybody, but at least one person was stumped enough to email, so I felt it should be fixed.</p>
<p>Those who do not want to redownload the file and can open large XML files (e.g. in Vim), can fix the problem by themselves, as it is extremely obvious right at the end of the first line.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.uncorpora.org/blog/2009/08/two-corpus-files-reposted/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>New corpus version, updated tool, copy of the paper are now available</title>
		<link>http://www.uncorpora.org/blog/2009/08/new-corpus-version-updated-tool-copy-of-the-paper-are-now-available/</link>
		<comments>http://www.uncorpora.org/blog/2009/08/new-corpus-version-updated-tool-copy-of-the-paper-are-now-available/#comments</comments>
		<pubDate>Sun, 30 Aug 2009 04:52:48 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Tools and Tips]]></category>

		<guid isPermaLink="false">http://www.uncorpora.org/blog/?p=10</guid>
		<description><![CDATA[<p>The main website has been updated with a number of items, just in time for the MT Summit XII poster session:</p> The code repository has been updated. Now specific languages can be extracted, vote segments removed, footnotes removed and in-paragraph annotations flattened. A new version of the corpus has been uploaded processed by the [...]]]></description>
			<content:encoded><![CDATA[<p>The main website has been updated with a number of items, just in time for the MT Summit XII poster session:</p>
<ol>
<li>The code repository has been updated. Now specific languages can be extracted, vote segments removed, footnotes removed and in-paragraph annotations flattened.</li>
<li>A new version of the corpus has been uploaded processed by the tool above to to remove footnotes and flatten the in-paragraph annotations. Otherwise, the content is the same. This version is more suitable for direct import into commercial tools, which may not be implementing the fancier bits of TMX.</li>
<li>The paper describing the corpus is now linked to.</li>
</ol>
]]></content:encoded>
			<wfw:commentRss>http://www.uncorpora.org/blog/2009/08/new-corpus-version-updated-tool-copy-of-the-paper-are-now-available/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

