I2: Resource Description
I can hardly believe it's been eight months since I last wrote about the NISO I2 project. A lot has changed since then[1]. I continue to work on I2 however; they won't get rid of me that easily.
In the last post, I wrote:
The next step is to build upon the report to draw yet more conclusions from the data — there's an awful lot there — and flesh out some repository use cases for institutional identifiers. The I2 core group is moving quickly towards finalizing identifier metadata elements so that a standard may be drafted, and I think having some use cases documented will help drive the standard in a direction the community can get behind.
Since that time, the three scenario groups — Electronic Resources; Institutional Repositories and Learning Management Systems; and Library Resource Management — have concluded their work. The work of the scenario groups included surveys of over 300 people working in these fields. The survey results have been analyzed and reports were posted on the NISO website. These reports have been used to flesh out use cases for an institutional identifier. Upon completion of this work, the scenario groups were disbanded and work continued in a broader I2 working group.
The I2 working group has concentrated its work on analysis of similar standards and, as I alluded to earlier, significant effort has gone into defining core metadata to identify institutions, such as institution name, institution type, location information, variant identifiers, domain name(s), URL(s), and (optionally-typed) relationships to other institutions. During these discussions it was difficult for me to hear the issues and needs around I2's metadata and identifiers without linked data springing to mind.
While we are designing a standard and not a system or a service per se, it seems useful to include in the standard an informative section about implementation and architecture[2]; I find that reading standards is much easier on the brain when you get not only the standard itself but some examples of implementation, and that will be true as well, one hopes, of I2 standard implementers. To that end, the group will be producing an XML schema of the I2 metadata elements and also an RDF schema.
I have been working on the RDF for I2 on and off for the past month or two. Below are my impressions, as someone who is new to modeling in RDF, and the procedures I used to produce the draft RDF schema.
Continue reading…
Notes
- I've moved and changed jobs, in fact [↩]
- This practice seems more or less common in my (admittedly limited) experience, cf. the unAPI specification. [↩]
Linking World Digital Library Data
As I mentioned earlier, I've been learning about linked data in the context of dropping it into the World Digital Library project. I am hopeful we'll be able to deploy the RDF views[1] before too long. In advance of that, I thought it might be helpful to share a sample of what our RDF would look like. The RDF below represents the WDL item for the U.S. Constitution. I appreciate constructive criticism.
A few things to note:
- Mmm, Unicode.
- Item types are from the Bibliographic Ontology.
- Most of the properties are from the Dublin Core Metadata Element Set ontology, especially used where literals are objects rather than resources identified by URI.
- Where possible I dug up or found URIs and used the Dublin Core Metadata Terms ontology.
- An item is modeled as an aggregation of its constituent files, as defined in OAI-ORE. The notion here is that an ORE aggregation of an item, as expressed in a resource map which is discoverable via a link header in each item detail page, is a "whole" item, including all of its files[2], metadata, and translations.
- I'm also making light use of the NEPOMUK File Ontology to express that constituent files are files, and to be explicit about file sizes so that folks know in advance of retrieving it how large files are.
- Links out to DDC (Decimalised Database of Concepts), Lingvoj, DBpedia, and Library of Congress Authorities & Vocabularies (e.g., LC Subject Headings) are included where possible. [3] I'd be especially stoked to hear of other vocabs I might link to. The more linked the data, the better.
- The output below is Turtle for readability, but the application will offer up RDF/XML.
The data after the jump:
Continue reading…
Notes
- Sadly, the URIs are uglyish due to some constraints from our caching configuration. I figure we can redirect uglyish URIs to cool ones and make use of owl:sameAs if those constraints go away. [↩]
- sans certain low-quality derivatives such as small thumbnails and tiles for the zoom interface [↩]
- I was poking through the DBpedia output for Geonames URIs as well, but my method was way too slow and clunky, so that's disabled for the time being. Clients can always follow their noses from the DBpedia output. [↩]
Validating ORE from the Command-line
I've been periodically poking at getting Linked Data/RDF views hooked into the World Digital Library web application, following Ed Summers' lead from his work on Chronicling America. The RDF views also use the OAI-ORE vocabulary to express aggregations — in WDL, an item is an aggregation of its constituent files. The goal is to provide a semantically rich and holistic representation of a WDL item (identifier, constituent files, metadata, translations, and so on).
The ORE format is a new one for me so it's hard to say whether the output of my dev branch is valid ORE or not. Plus I'm a sucker for validators. Turns out Rob Sanderson has developed a Python library for validating ORE, and this little snippet is what I've been using to validate the ORE. I didn't put much effort into making it readable, so much as banging something functional out so I can meet deadlines, so mea culpa and all that. But without further hemming and hawing, the code:
# validate.py import sys from foresite import * rem = RdfLibParser().parse(ReMDocument(sys.argv[1])) aggr = rem.aggregation n3 = RdfLibSerializer('n3') rem2 = aggr.register_serialization(n3) print rem2.get_serialization(n3).data
Most of this code is naively copied and pasted from Rob's excellent Foresite documentation.
I invoke it thusly: python validate.py {URL}
And the output:
@prefix _27: <http://www.semanticdesktop.org/ontologies/nfo#>.
@prefix _28: <http://localhost/en/item/1/id#>.
@prefix _29: <http://localhost/en/item/1/>.
@prefix bibo: <http://purl.org/ontology/bibo/>.
@prefix dc: <http://purl.org/dc/elements/1.1/>.
@prefix dcterms: <http://purl.org/dc/terms/>.
@prefix ore: <http://www.openarchives.org/ore/terms/>.
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>.
@prefix rdfs1: <http://www.w3.org/2001/01/rdf-schema#>.
_28:ResourceMap a ore:ResourceMap;
dc:format "text/rdf+n3";
dcterms:created "2009-07-31T14:23:31Z";
dcterms:modified "2009-07-31T14:23:31Z";
ore:describes _29:id.
_29:id a bibo:Image,
ore:Aggregation;
dcterms:DDC "973";
dcterms:alternative "Antietam, Maryland. Allan Pinkerton, President Lincoln, and Major General John A. McClernand"@en;
dcterms:created "1862年10月3日"@zh,
"3 de octubre de 1862"@es,
"3 de outubro de 1862"@pt,
"3 octobre 1862"@fr,
"3 октÑÐ±Ñ€Ñ 1862 года"@ru,
"October 3, 1862"@en,
" ٣ آكتوبر، ١٨٦٢"@ar;
dcterms:creator "Gardner, Alexander"@en,
"Gardner, Alexander"@es,
"Gardner, Alexander"@fr,
"Gardner, Alexander"@pt,
"Гарднер, ÐлекÑандр"@ru,
"جاردنر, أليكسندر"@ar,
"åŠ å¾·çº³, 亚历山大"@zh;
... (and so on and so forth)
dcterms:title "Antietam, Maryland. Allan Pinkerton, President Lincoln, and Major General John A. McClernand: Another View"@en,
"Antietam, Maryland. Allan Pinkerton, el Presidente Lincoln y el General Principal John A. McClernand: Otra visión"@es,
"Antietam, Maryland. Allan Pinkerton, le président Lincoln et le général-major John A. McClernand: Autre vue"@fr,
"Antietam, Maryland. Allan Pinkerton, Â Presidente Lincoln e Major-General John A. McClernand: Outra Vista"@pt,
"ÐнтитÑм, штат МÑриленд. Ðллан Пинкертон, президент Линкольн и генерал-майор Джон Ð. Макклернанд: Другой Ñнимок"@ru,
"أنتينام، ميريلاند ألان بينكرتون، الرئيس لينكولن، واللواء جون أ. ماكليرناند: منظر آخر"@ar,
"安蒂特姆,马里兰州 è‰¾ä¼¦Â·å¹³å…‹é¡¿ã€æž—肯总统和少将约翰·A ·马克克拉å—: å¦ä¸€ä¸ªè§†è§’"@zh;
ore:aggregates <http://localhost/static/c/1/reference/04326u_thumb_item.gif>,
<http://localhost/static/c/1/service/04326u.tif>;
ore:isDescribedBy <http://localhost/en/item/1/item.rdf>;
rdfs:seeAlso <http://hdl.loc.gov/loc.wdl/dlc.1>.
<http://localhost/static/c/1/reference/04326u_thumb_item.gif> a _27:FileDataObject;
dcterms:format "image/gif";
_27:fileSize "34531"^^<http://www.w3.org/2001/XMLSchema#long>.
<http://localhost/static/c/1/service/04326u.tif> a _27:FileDataObject;
dcterms:format "image/tiff";
_27:fileSize "1301614"^^<http://www.w3.org/2001/XMLSchema#long>.
ore:Aggregation rdfs1:isDefinedBy <http://www.openarchives.org/ore/terms/>;
rdfs1:label "Aggregation".
ore:ResourceMap rdfs1:isDefinedBy <http://www.openarchives.org/ore/terms/>;
rdfs1:label "ResourceMap".You might pick up on some warts I have yet to fix, but there you go.
I2: Strawman
[Series]
In the prior I2 post, I wrote about the requirements the repositories subgroup has come up with for an institutional identifier standard (with the hope that our findings re: repositories could be generalized to other scenarios).

- Image by PhOtOnQuAnTiQuE via Flickr
My strawman proposal of sorts is to explore how well linked data patterns fit this problem space. Linked data, briefly, is a way to expose and link data on the web in a more semantically meaningful way, and is often summarized using the four principles put forward by Tim Berners-Lee:
- Use URIs as names for things
- Use HTTP URIs so that people can look up those names.
- When someone looks up a URI, provide useful information.
- Include links to other URIs. so that they can discover more things.
That's the crux of it. Linked data takes well-known patterns on the web (linking, dereferencing, etc.) and applies them to data, which in this case could be metadata for identifying institutions.
Let's examine each of the requirements and the applicability of linked data thereto.
- Should be agnostic to type of institution, e.g., libraries, museums, personal collections, historical societies: The web is already agnostic to type of institution. HTTP URIs do not favor one type of institution over another.
- Should handle varying institutional granularity, e.g., institution-level, campus-level, division-level, unit-level: HTTP URIs are flexible in this regard. Hierarchy, should one wish it to be surfaced in the identifier, may be encoded in either a DNS hostname or the path appended to the DNS name. One can imagine a URI like "http://department.division.institution.tld/unit/subunit" or "http://institution.tld/campus/office/individual".
Hierarchy needn't be surfaced in the identifier if one favors opacity, in which case "http://registry.tld/xnjsdasd" would suffice as an identifier, and may instead be entirely reflected in the (RDF) representation returned by dereferencing the URI. - Should handle linking among institutions and subordinate units: Linked data handles linking via well-known HTTP mechanisms, referenced in the fourth principle of linked data. Unlike the HTTP link, which has limited semantics, linked data links are semantically rich and extensible.
- Should express different sorts of relationships among these institutions and units: The "useful information" in the third principle of linked data is typically provided by an RDF representation, which is itself a list of assertions. These assertions, or triples, consist of subjects, predicates, and objects. The ability to express the relationships in this requirement is limited only by the availability of vocabularies that contain sets of predicates and classes for subjects and objects. Think of the predicates as elements defined within a metadata standard, e.g., Dublin Core "creator", MODS "relatedItem", and so forth. Vocabularies that contain these predicates and classes are growing and evolving daily, and should there not be a vocabulary that contains the relationship one wishes to express, it is fairly easy to create a custom vocabulary.
The ability to mix and match vocabularies provides an expressiveness that is often not found in document-based metadata formats and the flexibility to express radically different relationships on a per-industry or per-institution basis. This latter point is important as the I2 group has identified both core metadata elements for identifying institutions of different types and additional elements for specific types of institutions. Why re-invent a new metadata format or schema when all one needs to express may already be contained in others? - Should relate to existing relevant identifiers and registries: Same as requirement#4. Linked data is all about expressing relationships between things, e.g., institutions, identifiers, registries, etc.
- Should be globally unique: HTTP URIs are guaranteed to be globally unique by virtue of the distributed DNS system and hierarchical naming within each HTTP service.
- Should be actionable: HTTP URIs provide dereferenceability/actionability via the well-known HTTP protocol.
- Should enable retrieval of metadata sufficient to identify the institution, which may vary widely by institution: HTTP URIs are actionable per requirement #7 and the metadata returned is flexible per requirement #4.
- Should accommodate changes as institutions come and go and re-organize and be able to relate defunct institutions to new ones: Linked data patterns provide for redirecting from defunct representations (institutional identifiers) to new ones via HTTP redirects. One may also add assertions to institutional metadata such as owl:sameAs, for instance, which says that the institution identified by the given URI is the same as another institution identified by another URI.
This seems like a compelling path to follow for the I2 standard.
The I2 repositories subgroup will be sending out its survey on identifier use cases in the coming week. It will be interesting to see if the requirements we have thus far identified still obtain in light of the data we collect from the survey. If so, I would like to explore the idea of linked data for institutional identifiers a bit more.
