<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>&#964;&#949;&#967;&#957;&#959;&#963;&#959;&#966;&#953;&#945; &#187; Metadata Evaluation Toolkit</title>
	<atom:link href="http://lackoftalent.org/michael/blog/category/projects/metadata-evaluation-toolkit/feed/" rel="self" type="application/rss+xml" />
	<link>http://lackoftalent.org/michael/blog</link>
	<description>The occasional rambling of a digital library artisan</description>
	<lastBuildDate>Thu, 20 May 2010 00:21:51 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0</generator>
		<item>
		<title>WDL metadata mapping, and, parsing TEI in Python</title>
		<link>http://lackoftalent.org/michael/blog/2009/07/13/wdl-metadata-mapping-and-parsing-tei-in-python/</link>
		<comments>http://lackoftalent.org/michael/blog/2009/07/13/wdl-metadata-mapping-and-parsing-tei-in-python/#comments</comments>
		<pubDate>Mon, 13 Jul 2009 22:27:46 +0000</pubDate>
		<dc:creator>Michael Giarlo</dc:creator>
				<category><![CDATA[Cataloging and Metadata]]></category>
		<category><![CDATA[Metadata Evaluation Toolkit]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[World Digital Library]]></category>

		<guid isPermaLink="false">http://lackoftalent.org/michael/blog/?p=430</guid>
		<description><![CDATA[Context Early on in the effort to develop the first public version of the World Digital Library web application, we developed a (non-public) Django-based cataloging application where Library of Congress catalogers could manage metadata for WDL items. Management in this sense includes creation of records, editing of records, versioning of edits, mapping of source records, [...]]]></description>
			<content:encoded><![CDATA[<abbr class="unapi-id" title="oai:lackoftalent.org:technosophia:430"><!-- &nbsp; --></abbr>
<h2>Context</h2>
<p>Early on in the effort to develop the first public version of the World Digital Library <a href="http://www.wdl.org/">web application</a>, we developed a (non-public) Django-based cataloging application where Library of Congress catalogers could manage metadata for WDL items.  Management in this sense includes creation of records, editing of records, versioning of edits, mapping of source records, and some light workflow for assignment of records to individual catalogers and for hooking into translation processes[1].  </p>
<p>I worked primarily on the source record mapping tools.  They take a number of formats as input and are called by the cataloging application to map metadata from these formats into the WDL domain model.  Several though not all of which are XML-based, and thus easily dealt with in Python, via the <a href="http://codespeak.net/lxml/api.html">etree module in the lxml package</a>.  </p>
<p><a href="http://onebiglibrary.net/">Dan</a> recently kicked off a new R&#038;D project for evaluating (any) metadata against any number of metadata profiles, mapping into a generic data dictionary, the goal being to determine how feasible it would be to develop a toolset for aiding remediation of metadata across any number of digital collections.  I have been working on this project with Dan, and got started by seeing how generalizable the WDL metadata mapping tools are.  Turns out they&#039;re fairly generalizable once you tweak the various format-specific mapping rules to map into the generic data dictionary model rather than the WDL model (around 15 elements, and somewhere between Dublin Core and MODS in terms of specificity but flatly structured like DC).</p>
<p>Some of the test data I am working with now, that has nothing to do with WDL, is SGML-based <a href="http://quod.lib.umich.edu/t/tei/">TEI 2</a> markup.  The closest I worked with on WDL was <a href="http://www.tei-c.org/release/doc/tei-p5-doc/html/MS.html">TEI P5 for manuscript description</a> which is serialized in XML.  Turns out my TEI mapping rules from before blew up on this TEI 2 stuff, as lxml.etree (naturally) wasn&#039;t digging the non-XML input.  I googled around a bit for how best to parse TEI (or any SGML) in Python and then discovered it&#039;s actually simple as pie.</p>
<h2>Code</h2>
<p>If you&#039;ve got the <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> module installed[2]:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">from</span> BeautifulSoup <span style="color: #ff7700;font-weight:bold;">import</span> BeautifulSoup
<span style="color: #66cc66;">&gt;&gt;&gt;</span> tei = <span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'foo.sgm'</span><span style="color: black;">&#41;</span>.<span style="color: black;">read</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> BeautifulSoup<span style="color: black;">&#40;</span>tei<span style="color: black;">&#41;</span>.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'title'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: #dc143c;">string</span>
u<span style="color: #483d8b;">'[Memorandum to Dr. Botkin]: a machine readable transcription.'</span></pre></div></div>

<p>If not, the <a href="http://codespeak.net/lxml/lxmlhtml.html">lxml.html</a> module works too:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #66cc66;">&gt;&gt;&gt;</span> <span style="color: #ff7700;font-weight:bold;">from</span> lxml <span style="color: #ff7700;font-weight:bold;">import</span> html
<span style="color: #66cc66;">&gt;&gt;&gt;</span> h = html.<span style="color: black;">parse</span><span style="color: black;">&#40;</span><span style="color: #008000;">open</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'foo.sgm'</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
<span style="color: #66cc66;">&gt;&gt;&gt;</span> h.<span style="color: black;">xpath</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'//title'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>.<span style="color: black;">text</span>
<span style="color: #483d8b;">'[Memorandum to Dr. Botkin]: a machine readable transcription.'</span></pre></div></div>

<h2>Data</h2>
<p>And here&#039;s what the sample data looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;"><span style="color: #009900;">&lt;!doctype tei2 public <span style="color: #ff0000;">&quot;-//Library of Congress - Historical Collections (American Memory)//DTD ammem.dtd//EN&quot;</span> </span>
<span style="color: #009900;"><span style="color: #66cc66;">&#91;</span></span>
<span style="color: #009900;">&lt;!entity % images system <span style="color: #ff0000;">&quot;07010101.ent&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span> %images;
]&gt;
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;tei2<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;teiheader</span> <span style="color: #000066;">type</span>=<span style="color: #ff0000;">&quot;text&quot;</span> <span style="color: #000066;">date.created</span>=<span style="color: #ff0000;">&quot;1994/03/15&quot;</span> <span style="color: #000066;">date.updated</span>=<span style="color: #ff0000;">&quot;2002/04/05&quot;</span> <span style="color: #000066;">status</span>=<span style="color: #ff0000;">&quot;updated&quot;</span> <span style="color: #000066;">creator</span>=<span style="color: #ff0000;">&quot;National Digital Library Program</span>
<span style="color: #009900;">, Library of Congress&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;filedesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;titlestmt<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;amid</span> <span style="color: #000066;">type</span>=<span style="color: #ff0000;">&quot;aggitemid&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>wpa0-07010101<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/amid<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;title<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>[Memorandum to Dr. Botkin]: a machine readable transcription.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/title<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;amcol<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;amcolname<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Life Histories from the Folklore Project, WPA Federal Writers<span style="color: #ddbb00;">&amp;apos;</span> Project, 1936-1940; American Memory, Library of Congress.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/amcolname<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;amcolid</span> <span style="color: #000066;">type</span>=<span style="color: #ff0000;">&quot;aggid&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span><span style="color: #000000; font-weight: bold;">&lt;/amcolid<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/amcol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;respstmt<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;resp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Selected and converted.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/resp<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>American Memory, Library of Congress.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/name<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/respstmt<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/titlestmt<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;publicationstmt<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Washington, DC, 1994.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Preceding element provides place and date of transcription only.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>For more information about this text and this American Memory collection, refer to accompanying matter.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/publicationstmt<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;sourcedesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;lccn<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/lccn<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;sourcecol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>U.S. Work Projects Administration, Federal Writers<span style="color: #ddbb00;">&amp;apos;</span> Project (Folklore Project, Life Histories, 1936-39); Manuscript Division, Library of Congress.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/sourcecol<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;copyright<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Copyright status not determined; refer to accompanying matter.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/copyright<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/sourcedesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/filedesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;encodingdesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;projectdesc<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>The National Digital Library Program at the Library of Congress makes digitized historical materials available for education and scholarship.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/projectdesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;editorialdecl<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This transcription is intended to have an accuracy of 99.95 percent or greater and is not intended to reproduce the appearance of the original work.  The accompanying images provide a facsimile of this work and represent the appearance of the original.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/editorialdecl<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;encodingdate<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>1994/03/15<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/encodingdate<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;revdate<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>2002/04/05<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/revdate<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/encodingdesc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/teiheader<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;text</span> <span style="color: #000066;">type</span>=<span style="color: #ff0000;">&quot;manuscript&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;body<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;div<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;pageinfo<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;controlpgno</span> <span style="color: #000066;">entity</span>=<span style="color: #ff0000;">&quot;I07010101&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>0001<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/controlpgno<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;printpgno<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/printpgno<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/pageinfo<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Memorandum to Dr. Botkin from G. B. Roberts, May 26, 1941<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Subject:  Alabama Material<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>This material has not yet been accessioned and has only 
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;del</span> <span style="color: #000066;">rend</span>=<span style="color: #ff0000;">&quot;overstrike&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>beeen<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/del<span style="color: #000000; font-weight: bold;">&gt;</span></span></span> been roughly classified as life histories, folklore, and miscellaneous data and copy save in the case of the 2 ex-slave items and the essay on Jesse Owens, each of which was recommended.<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;p<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>Total no. of items recommended:  3 (14 pp.) 
<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;handwritten<span style="color: #000000; font-weight: bold;">&gt;</span></span></span>In progress<span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/handwritten<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/p<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/div<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/body<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/text<span style="color: #000000; font-weight: bold;">&gt;</span></span><span style="color: #000000; font-weight: bold;">&lt;/tei2<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<h5>Notes</h5><ol class="footnotes"><li id="footnote_0_430" class="footnote">Catalogers cataloged stuff in the English language, but every metadata record needed to be translated into the other six U.N. languages: Spanish, Russian, French, Arabic, Chinese, and Portuguese.</li><li id="footnote_1_430" class="footnote">And you are but one <code>sudo easy_install BeautifulSoup</code> away from that.</li></ol><br/>
<hr/>]]></content:encoded>
			<wfw:commentRss>http://lackoftalent.org/michael/blog/2009/07/13/wdl-metadata-mapping-and-parsing-tei-in-python/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
