Exploring curation micro-services

Posted by Michael Giarlo on September 27, 2009

thumbnail of micro-repo treeAs far as I'm concerned, the most exciting developments this year in repositories and digital curation have come out of the California Digital Library. It has been impossible not to notice their papers and presentations. Put simply, their idea is that digital curation is enabled by "micro-services" built upon well-known abstractions such as the filesystem. The benefits are obvious: filesystem tools are ubiquitous and cross-platform, and there are strong market forces to ensure the filesystem persists. The idea is radically simple and straightforward, though many questions remain about such a paradigm. I'll return to those later.

If you have not yet taken a look at CDL's curation micro-service specifications, most of which may be printed on as few as one or two sheets of paper, see the Digital Library Building Blocks.

My co-workers in the LC Repository Development Center have been chatting about these specs on and off throughout the year. After months of procrastinating, I finally read all of the specs on Thursday; it's wonderful that you can do so in the course of one reading session, I might add. Yesterday a bunch of us RDCers got together to chat (informally) about the specs: what they're for, how they work, and how they interact with one another. I learn by doing, by examples, so I combed through each of the specs in advance of our meeting and tried to construct a minimal repository[1] based on micro-services.
Continue reading…

Notes
  1. Perhaps it's more in line with the specs to refer to this space as "a managed filesystem that drives repository and curation services," given the CDL philosophy that preservation is not a place/repository. But it's easier to say "repository," so there you go. []


I2: Survey results

Posted by Michael Giarlo on September 15, 2009

I wrote in June that the I2 subgroup surveyed "repository managers to determine the current practices and needs of the repository community regarding institutional identifiers. Results from the survey will inform a set of use cases that will be shared with the community, and that are expected to drive the development of a new standard for institutional identifiers."

The survey closed in July, and the subgroup spent August writing a report on the survey results. That report is now final and it's available to the public. Feedback may be sent to our (woefully underutilized) public i2info mailing list, left as a comment on this post, or e-mailed to me privately which I can forward to our internal list.

The next step is to build upon the report to draw yet more conclusions from the data — there's an awful lot there — and flesh out some repository use cases for institutional identifiers. The I2 core group is moving quickly towards finalizing identifier metadata elements so that a standard may be drafted, and I think having some use cases documented will help drive the standard in a direction the community can get behind.

Onward and upward.

Linking World Digital Library Data

Posted by Michael Giarlo on August 10, 2009

As I mentioned earlier, I've been learning about linked data in the context of dropping it into the World Digital Library project. I am hopeful we'll be able to deploy the RDF views[1] before too long. In advance of that, I thought it might be helpful to share a sample of what our RDF would look like. The RDF below represents the WDL item for the U.S. Constitution. I appreciate constructive criticism.

A few things to note:

  • Mmm, Unicode.
  • Item types are from the Bibliographic Ontology.
  • Most of the properties are from the Dublin Core Metadata Element Set ontology, especially used where literals are objects rather than resources identified by URI.
  • Where possible I dug up or found URIs and used the Dublin Core Metadata Terms ontology.
  • An item is modeled as an aggregation of its constituent files, as defined in OAI-ORE. The notion here is that an ORE aggregation of an item, as expressed in a resource map which is discoverable via a link header in each item detail page, is a "whole" item, including all of its files[2], metadata, and translations.
  • I'm also making light use of the NEPOMUK File Ontology to express that constituent files are files, and to be explicit about file sizes so that folks know in advance of retrieving it how large files are.
  • Links out to DDC (Decimalised Database of Concepts), Lingvoj, DBpedia, and Library of Congress Authorities & Vocabularies (e.g., LC Subject Headings) are included where possible. [3] I'd be especially stoked to hear of other vocabs I might link to. The more linked the data, the better.
  • The output below is Turtle for readability, but the application will offer up RDF/XML.

The data after the jump:
Continue reading…

Notes
  1. Sadly, the URIs are uglyish due to some constraints from our caching configuration. I figure we can redirect uglyish URIs to cool ones and make use of owl:sameAs if those constraints go away. []
  2. sans certain low-quality derivatives such as small thumbnails and tiles for the zoom interface []
  3. I was poking through the DBpedia output for Geonames URIs as well, but my method was way too slow and clunky, so that's disabled for the time being. Clients can always follow their noses from the DBpedia output. []


Validating ORE from the Command-line

Posted by Michael Giarlo on July 31, 2009

I've been periodically poking at getting Linked Data/RDF views hooked into the World Digital Library web application, following Ed Summers' lead from his work on Chronicling America. The RDF views also use the OAI-ORE vocabulary to express aggregations — in WDL, an item is an aggregation of its constituent files. The goal is to provide a semantically rich and holistic representation of a WDL item (identifier, constituent files, metadata, translations, and so on).

The ORE format is a new one for me so it's hard to say whether the output of my dev branch is valid ORE or not. Plus I'm a sucker for validators. Turns out Rob Sanderson has developed a Python library for validating ORE, and this little snippet is what I've been using to validate the ORE. I didn't put much effort into making it readable, so much as banging something functional out so I can meet deadlines, so mea culpa and all that. But without further hemming and hawing, the code:

# validate.py
import sys
from foresite import *
 
rem = RdfLibParser().parse(ReMDocument(sys.argv[1]))
aggr = rem.aggregation
n3 = RdfLibSerializer('n3')
rem2 = aggr.register_serialization(n3)
print rem2.get_serialization(n3).data

Most of this code is naively copied and pasted from Rob's excellent Foresite documentation.

I invoke it thusly: python validate.py {URL}

And the output:

@prefix _27: <http://www.semanticdesktop.org/ontologies/nfo#>.
@prefix _28: <http://localhost/en/item/1/id#>.
@prefix _29: <http://localhost/en/item/1/>.
@prefix bibo: <http://purl.org/ontology/bibo/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix ore: <http://www.openarchives.org/ore/terms/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rdfs1: <http://www.w3.org/2001/01/rdf-schema#>.
 
 _28:ResourceMap a ore:ResourceMap;
     dc:format "text/rdf+n3";
     dcterms:created "2009-07-31T14:23:31Z";
     dcterms:modified "2009-07-31T14:23:31Z";
     ore:describes _29:id. 
 
 _29:id a bibo:Image,
         ore:Aggregation;
     dcterms:DDC "973";
     dcterms:alternative "Antietam, Maryland. Allan Pinkerton, President Lincoln, and Major General John A. McClernand"@en;
     dcterms:created "1862年10月3日"@zh,
         "3 de octubre de 1862"@es,
         "3 de outubro de 1862"@pt,
         "3 octobre 1862"@fr,
         "3 октября 1862 года"@ru,
         "October 3, 1862"@en,
         " ٣ آكتوبر، ١٨٦٢"@ar;
     dcterms:creator "Gardner, Alexander"@en,
         "Gardner, Alexander"@es,
         "Gardner, Alexander"@fr,
         "Gardner, Alexander"@pt,
         "Гарднер, Александр"@ru,
         "جاردنر, أليكسندر"@ar,
         "加德纳, 亚历山大"@zh;
... (and so on and so forth)
     dcterms:title "Antietam, Maryland. Allan Pinkerton, President Lincoln, and Major General John A. McClernand: Another View"@en,
         "Antietam, Maryland. Allan Pinkerton, el Presidente Lincoln y el General Principal John A. McClernand: Otra visión"@es,
         "Antietam, Maryland. Allan Pinkerton, le président Lincoln et le général-major John A. McClernand: Autre vue"@fr,
         "Antietam, Maryland. Allan Pinkerton,  Presidente Lincoln e Major-General John A. McClernand: Outra Vista"@pt,
         "Антитэм, штат Мэриленд. Аллан Пинкертон, президент Линкольн и генерал-майор Джон А. Макклернанд: Другой снимок"@ru,
         "أنتينام، ميريلاند ألان بينكرتون، الرئيس لينكولن، واللواء جون أ. ماكليرناند: منظر آخر"@ar,
         "安蒂特姆,马里兰州 艾伦·平克顿、林肯总统和少将约翰·A ·马克克拉南: 另一个视角"@zh;
     ore:aggregates <http://localhost/static/c/1/reference/04326u_thumb_item.gif>,
         <http://localhost/static/c/1/service/04326u.tif>;
     ore:isDescribedBy <http://localhost/en/item/1/item.rdf>;
     rdfs:seeAlso <http://hdl.loc.gov/loc.wdl/dlc.1>. 
 
 <http://localhost/static/c/1/reference/04326u_thumb_item.gif> a _27:FileDataObject;
     dcterms:format "image/gif";
     _27:fileSize "34531"^^<http://www.w3.org/2001/XMLSchema#long>. 
 
 <http://localhost/static/c/1/service/04326u.tif> a _27:FileDataObject;
     dcterms:format "image/tiff";
     _27:fileSize "1301614"^^<http://www.w3.org/2001/XMLSchema#long>. 
 
 ore:Aggregation rdfs1:isDefinedBy <http://www.openarchives.org/ore/terms/>;
     rdfs1:label "Aggregation". 
 
 ore:ResourceMap rdfs1:isDefinedBy <http://www.openarchives.org/ore/terms/>;
     rdfs1:label "ResourceMap".

You might pick up on some warts I have yet to fix, but there you go.

A Digital Object Defined

Posted by Michael Giarlo on July 15, 2009

What happens to a digital object defined?[1]



Does its identifier dry up

like a raisin in the sun?

Or its relationships fester like a sore–

And then run?

Do its bits rot like meat?

Or become overwritten–

like some throw-away sheet?



Maybe its metadata just sags

like a heavy load.



Or does it fade into code?

Notes
  1. Inspired by Langston Hughes's "A Dream Deferred" and a spirited conversation in the office today. []