I2: Strawman

Posted by Michael Giarlo on June 13, 2009

[Series]

In the prior I2 post, I wrote about the requirements the repositories subgroup has come up with for an institutional identifier standard (with the hope that our findings re: repositories could be generalized to other scenarios).

PhotonQ-Tim Berners Lee on Linked Data at TED
Image by PhOtOnQuAnTiQuE via Flickr

My strawman proposal of sorts is to explore how well linked data patterns fit this problem space. Linked data, briefly, is a way to expose and link data on the web in a more semantically meaningful way, and is often summarized using the four principles put forward by Tim Berners-Lee:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

That's the crux of it.  Linked data takes well-known patterns on the web (linking, dereferencing, etc.) and applies them to data, which in this case could be metadata for identifying institutions.

Let's examine each of the requirements and the applicability of linked data thereto.

  1. Should be agnostic to type of institution, e.g., libraries, museums, personal collections, historical societies: The web is already agnostic to type of institution.  HTTP URIs do not favor one type of institution over another.
  2. Should handle varying institutional granularity, e.g., institution-level, campus-level, division-level, unit-level: HTTP URIs are flexible in this regard.  Hierarchy, should one wish it to be surfaced in the identifier, may be encoded in either a DNS hostname or the path appended to the DNS name.  One can imagine a URI like "http://department.division.institution.tld/unit/subunit" or "http://institution.tld/campus/office/individual".

    Hierarchy needn't be surfaced in the identifier if one favors opacity, in which case "http://registry.tld/xnjsdasd" would suffice as an identifier, and may instead be entirely reflected in the (RDF) representation returned by dereferencing the URI.
  3. Should handle linking among institutions and subordinate units: Linked data handles linking via well-known HTTP mechanisms, referenced in the fourth principle of linked data.  Unlike the HTTP link, which has limited semantics, linked data links are semantically rich and extensible.
  4. Should express different sorts of relationships among these institutions and units: The "useful information" in the third principle of linked data is typically provided by an RDF representation, which is itself a list of assertions.  These assertions, or triples, consist of subjects, predicates, and objects.  The ability to express the relationships in this requirement is limited only by the availability of vocabularies that contain sets of predicates and classes for subjects and objects.  Think of the predicates as elements defined within a metadata standard, e.g., Dublin Core "creator", MODS "relatedItem", and so forth.  Vocabularies that contain these predicates and classes are growing and evolving daily, and should there not be a vocabulary that contains the relationship one wishes to express, it is fairly easy to create a custom vocabulary.

    The ability to mix and match vocabularies provides an expressiveness that is often not found in document-based metadata formats and the flexibility to express radically different relationships on a per-industry or per-institution basis.  This latter point is important as the I2 group has identified both core metadata elements for identifying institutions of different types and additional elements for specific types of institutions.  Why re-invent a new metadata format or schema when all one needs to express may already be contained in others?
  5. Should relate to existing relevant identifiers and registries: Same as requirement#4.  Linked data is all about expressing relationships between things, e.g., institutions, identifiers, registries, etc.
  6. Should be globally unique: HTTP URIs are guaranteed to be globally unique by virtue of the distributed DNS system and hierarchical naming within each HTTP service.
  7. Should be actionable: HTTP URIs provide dereferenceability/actionability via the well-known HTTP protocol.
  8. Should enable retrieval of metadata sufficient to identify the institution, which may vary widely by institution: HTTP URIs are actionable per requirement #7 and the metadata returned is flexible per requirement #4.
  9. Should accommodate changes as institutions come and go and re-organize and be able to relate defunct institutions to new ones: Linked data patterns provide for redirecting from defunct representations (institutional identifiers) to new ones via HTTP redirects.  One may also add assertions to institutional metadata such as owl:sameAs, for instance, which says that the institution identified by the given URI is the same as another institution identified by another URI.

This seems like a compelling path to follow for the I2 standard.

The I2 repositories subgroup will be sending out its survey on identifier use cases in the coming week.  It will be interesting to see if the requirements we have thus far identified still obtain in light of the data we collect from the survey.  If so, I would like to explore the idea of linked data for institutional identifiers a bit more.

Trackbacks

Use this link to trackback from your own site.

Comments

Leave a response

  1. Ross Sat, 13 Jun 2009 22:25:21 UTC

    +1

    You know I've got your back on this one.

    Assuming my scenario ever meets about anything.

  2. robert Sun, 14 Jun 2009 04:25:46 UTC

    maybe i have been brainwashed by the linkeddata crowd already – but what else than an HTTP URL and linkeddata would be an option?

  3. robert Sun, 14 Jun 2009 04:29:09 UTC

    i think HTTP URLs as institutional identifiers may also have the beneficial effect to force institutions to think more about their URL space (and domain names in particular). i'd say some sort of conscious management of domain names within an institution is necessary anyway – it's just overlooked often.

  4. Jonathan Rochkind Sun, 14 Jun 2009 09:35:04 UTC

    We need to see an example of what this means. Okay, HTTP URIs. How will it be decided and subsequently discovered which HTTP URI represents which insitution? Can you give an example in the context of an actual use case?

    I forget if the task group has some use cases identified in addition to requirements.

  5. Michael Giarlo Sun, 14 Jun 2009 10:41:08 UTC

    @Robert I'm almost scared to entertain that question. Perhaps a new metadata format stuck behind an OAI-PMH interface? I'm sure folks can think up much less webby options than that, even.

    @Jonathan I'm getting there, slowly. Rather than creating one mega-post about linked data and I2, I'm cutting them up into chunks and mulling them over while the I2 group's work evolves. I don't have an example yet but one is starting to form in my mind. Cool that you're interested enough to want more, though. :)

    At this point I want to be sure I'm not totally off the rails suggesting that linked data might be applied to this problem in light of the requirements thus far identified. I am sure as more use cases are found and as I flesh this idea out, the real issues will come out and the real problems will be revealed. It's one of the questions I wanted to explore in the next post, though I may devote the next one to examples and the final one to questions about, say, how to manage a global, decentralized inst. identifier infrastructure.

    But, to address your questions directly: The IR subgroup has designed a survey to elicit use cases for inst. identifiers, so I can't give a use case just yet, but we are administering the survey to repo managers around the world this week, so I hope to have more to say w/in the next month as survey results roll in.

  6. Jonathan Rochkind Sun, 14 Jun 2009 14:17:05 UTC

    Cool. To me, saying "http URIs" isn't an answer at all. I mean, we could assume there would be SOME string representing an institution that came out of this — that's the point, right? So, if that string is just random chars, or an http URI, no big deal. The hard part is figuring out how they get assigned, discovered, have metadata attached to them which is also discovered, etc. Saying "http URI" might be suggestive of certain directions, but it doesn't answer the actual hard questions. I'm not sure if the answers to the actual hard questions are dependent on whether the actual token is an http URI or not. Maybe?

  7. Michael Giarlo Sun, 14 Jun 2009 16:56:50 UTC

    @Jonathan: Right. And you'll notice I didn't just say HTTP URIs because suggesting an identifier scheme is the easy part; I'm interested in linked data, which builds upon HTTP URIs but it more than just assigning identifiers. Linked data patterns already include ways to disseminate metadata.

    The 800-lb. gorilla is, of course, how such a decentralized (or even a centralized) system is managed. Who gets access? Are records mutable? How do you handle changes in organizations? How much metadata do you need to return about an institution? Who manages that metadata, and how? And so on.

    The big questions have yet to be answered, and I sure don't mean to give the impression that I've got all the answers. However, I would like to see NISO fully explore a solution to this problem that is "of the web." It may work and it may not work, but I'd like to put this idea through the paces before discarding it.

Comments