Linking World Digital Library Data

Posted by Michael Giarlo on August 10, 2009

As I mentioned earlier, I've been learning about linked data in the context of dropping it into the World Digital Library project. I am hopeful we'll be able to deploy the RDF views[1] before too long. In advance of that, I thought it might be helpful to share a sample of what our RDF would look like. The RDF below represents the WDL item for the U.S. Constitution. I appreciate constructive criticism.

A few things to note:

  • Mmm, Unicode.
  • Item types are from the Bibliographic Ontology.
  • Most of the properties are from the Dublin Core Metadata Element Set ontology, especially used where literals are objects rather than resources identified by URI.
  • Where possible I dug up or found URIs and used the Dublin Core Metadata Terms ontology.
  • An item is modeled as an aggregation of its constituent files, as defined in OAI-ORE. The notion here is that an ORE aggregation of an item, as expressed in a resource map which is discoverable via a link header in each item detail page, is a "whole" item, including all of its files[2], metadata, and translations.
  • I'm also making light use of the NEPOMUK File Ontology to express that constituent files are files, and to be explicit about file sizes so that folks know in advance of retrieving it how large files are.
  • Links out to DDC (Decimalised Database of Concepts), Lingvoj, DBpedia, and Library of Congress Authorities & Vocabularies (e.g., LC Subject Headings) are included where possible. [3] I'd be especially stoked to hear of other vocabs I might link to. The more linked the data, the better.
  • The output below is Turtle for readability, but the application will offer up RDF/XML.

The data after the jump:
Continue reading…

Notes
  1. Sadly, the URIs are uglyish due to some constraints from our caching configuration. I figure we can redirect uglyish URIs to cool ones and make use of owl:sameAs if those constraints go away. []
  2. sans certain low-quality derivatives such as small thumbnails and tiles for the zoom interface []
  3. I was poking through the DBpedia output for Geonames URIs as well, but my method was way too slow and clunky, so that's disabled for the time being. Clients can always follow their noses from the DBpedia output. []


Is MARC a data model?

Posted by Michael Giarlo on August 10, 2009

I posted a status update to Twitter, identi.ca, and Facebook late last night hoping to suss out two questions:

  1. Is MARC a data model?
  2. But really: what qualifies something as a data model?

I'd poked around looking for clues to the latter and was left cold by the long Wikipedia entry. Maybe I've been doing the micro-blog thing for too long and my ability to parse information that comes in greater-than-140-character chunks has been damaged. Plus I like learning from examples, and what better example for the library geek than MARC?

The feedback I received was pretty impressive, and not all of it consistent with the rest. I found it an interesting example of crowdsourcing, so to speak. As each response came in, I would read it, cross-reference with, e.g., Wikipedia articles, for accuracy, and revise my own answers to the above questions. I'm honing in on an answer to the former question. The latter question is still a bit murky.

I thought I'd share the responses, too. Responses from Twitter are included in full w/ links to the original. Responses from quasi-public Facebook have been anonymized. You can see my replies interspersed as well and watch the evolution of the (admittedly short) discussion. After the jump:
Continue reading…

WDL metadata mapping, and, parsing TEI in Python

Posted by Michael Giarlo on July 13, 2009

Context

Early on in the effort to develop the first public version of the World Digital Library web application, we developed a (non-public) Django-based cataloging application where Library of Congress catalogers could manage metadata for WDL items. Management in this sense includes creation of records, editing of records, versioning of edits, mapping of source records, and some light workflow for assignment of records to individual catalogers and for hooking into translation processes[1].

I worked primarily on the source record mapping tools. They take a number of formats as input and are called by the cataloging application to map metadata from these formats into the WDL domain model. Several though not all of which are XML-based, and thus easily dealt with in Python, via the etree module in the lxml package.

Dan recently kicked off a new R&D project for evaluating (any) metadata against any number of metadata profiles, mapping into a generic data dictionary, the goal being to determine how feasible it would be to develop a toolset for aiding remediation of metadata across any number of digital collections. I have been working on this project with Dan, and got started by seeing how generalizable the WDL metadata mapping tools are. Turns out they're fairly generalizable once you tweak the various format-specific mapping rules to map into the generic data dictionary model rather than the WDL model (around 15 elements, and somewhere between Dublin Core and MODS in terms of specificity but flatly structured like DC).

Some of the test data I am working with now, that has nothing to do with WDL, is SGML-based TEI 2 markup. The closest I worked with on WDL was TEI P5 for manuscript description which is serialized in XML. Turns out my TEI mapping rules from before blew up on this TEI 2 stuff, as lxml.etree (naturally) wasn't digging the non-XML input. I googled around a bit for how best to parse TEI (or any SGML) in Python and then discovered it's actually simple as pie.

Code

If you've got the BeautifulSoup module installed[2]:

>>> from BeautifulSoup import BeautifulSoup
>>> tei = open('foo.sgm').read()
>>> BeautifulSoup(tei).findAll('title')[0].string
u'[Memorandum to Dr. Botkin]: a machine readable transcription.'

If not, the lxml.html module works too:

>>> from lxml import html
>>> h = html.parse(open('foo.sgm'))
>>> h.xpath('//title')[0].text
'[Memorandum to Dr. Botkin]: a machine readable transcription.'

Data

And here's what the sample data looks like:

<!doctype tei2 public "-//Library of Congress - Historical Collections (American Memory)//DTD ammem.dtd//EN" 
[
<!entity % images system "07010101.ent"> %images;
]>
<tei2>
<teiheader type="text" date.created="1994/03/15" date.updated="2002/04/05" status="updated" creator="National Digital Library Program
, Library of Congress">
<filedesc>
<titlestmt>
<amid type="aggitemid">wpa0-07010101</amid>
<title>[Memorandum to Dr. Botkin]: a machine readable transcription.</title>
<amcol><amcolname>Life Histories from the Folklore Project, WPA Federal Writers&apos; Project, 1936-1940; American Memory, Library of Congress.</amcolname><amcolid type="aggid"></amcolid>
</amcol>
<respstmt>
<resp>Selected and converted.</resp>
<name>American Memory, Library of Congress.</name>
</respstmt></titlestmt>
<publicationstmt>
<p>Washington, DC, 1994.</p>
<p>Preceding element provides place and date of transcription only.</p>
<p>For more information about this text and this American Memory collection, refer to accompanying matter.</p>
</publicationstmt>
<sourcedesc>
<lccn></lccn>
<sourcecol>U.S. Work Projects Administration, Federal Writers&apos; Project (Folklore Project, Life Histories, 1936-39); Manuscript Division, Library of Congress.</sourcecol>
<copyright>Copyright status not determined; refer to accompanying matter.</copyright></sourcedesc>
</filedesc>
<encodingdesc>
<projectdesc><p>The National Digital Library Program at the Library of Congress makes digitized historical materials available for education and scholarship.</p></projectdesc>
<editorialdecl><p>This transcription is intended to have an accuracy of 99.95 percent or greater and is not intended to reproduce the appearance of the original work.  The accompanying images provide a facsimile of this work and represent the appearance of the original.</p></editorialdecl>
<encodingdate>1994/03/15</encodingdate>
<revdate>2002/04/05</revdate>
</encodingdesc>
</teiheader>
<text type="manuscript">
<body>
<div>
<pageinfo>
<controlpgno entity="I07010101">0001</controlpgno>
<printpgno></printpgno></pageinfo>
<p>Memorandum to Dr. Botkin from G. B. Roberts, May 26, 1941</p>
<p>Subject:  Alabama Material</p>
<p>This material has not yet been accessioned and has only 
<del rend="overstrike">beeen</del> been roughly classified as life histories, folklore, and miscellaneous data and copy save in the case of the 2 ex-slave items and the essay on Jesse Owens, each of which was recommended.</p>
<p>Total no. of items recommended:  3 (14 pp.) 
<handwritten>In progress</handwritten></p></div></body></text></tei2>
Notes
  1. Catalogers cataloged stuff in the English language, but every metadata record needed to be translated into the other six U.N. languages: Spanish, Russian, French, Arabic, Chinese, and Portuguese. []
  2. And you are but one sudo easy_install BeautifulSoup away from that. []


Cataloging and institutional repositories

Posted by Michael Giarlo on February 09, 2009

While doing some reading for a little talk my colleague, Ed Summers, and I are giving at code4lib 2009, I came across a paragraph that sparked a crazy thought. So crazy that it's not crazy at all. So not crazy that I am sure other people have thought of it. But nonetheless, here I am writing about it just in case.

From Sarah Currier's paper on SWORD (emphasis mine):

One of the most frequently cited barriers to academics depositing their teaching materials into repositories is the keystroke-count involved in logging into a repository, uploading the resource, creating metadata, perhaps selecting a licence, and publishing the resource. It was a quick win, therefore, to create a drag-and-drop desktop tool to allow a single keystroke deposit of resources, including multiple resources in one action. For a repository that supports automatic metadata generation, administrative metadata can be created at the point of entry to the repository without the user needing to create any.

And I wondered how many repositories supported automatic metadata generation. I wondered how many repositories supported automatic generation of rich metadata. And lastly I wondered, might this be a more or less natural role for catalogers: augmenting stub metadata records or doing original cataloging for institutional repository deposits? Especially at a time when many of them are being reclassified as acquisitions specialists or digital projects managers?

Potential issues and questions:

  • Author ignorance: Maybe catalogers are already doing this and I'm a moron?
  • Scale: Is it realistic to expect to be able to "keep up" with repository deposits?
  • Granularity: Does cataloging at the level of articles, and perhaps at even finer granularities, introduce challenges?
  • Duplication: If pre-prints are cataloged in the IR, for instance, will they need to be cataloged again later?
  • … there are others I thought of on my commute this morning but have since forgotten them. Feel free to add comments.

I will admit here that I've been somewhat out of the (academic) institutional repository space a while, and cataloging is something I don't share thoughts about very often because my exposure is limited to having taken one course a couple years ago.

I assume there's a body of research about this out there somewhere but I figured I'd post this anyway.