Exploring curation micro-services

Posted by Michael Giarlo on September 27, 2009

thumbnail of micro-repo treeAs far as I'm concerned, the most exciting developments this year in repositories and digital curation have come out of the California Digital Library. It has been impossible not to notice their papers and presentations. Put simply, their idea is that digital curation is enabled by "micro-services" built upon well-known abstractions such as the filesystem. The benefits are obvious: filesystem tools are ubiquitous and cross-platform, and there are strong market forces to ensure the filesystem persists. The idea is radically simple and straightforward, though many questions remain about such a paradigm. I'll return to those later.

If you have not yet taken a look at CDL's curation micro-service specifications, most of which may be printed on as few as one or two sheets of paper, see the Digital Library Building Blocks.

My co-workers in the LC Repository Development Center have been chatting about these specs on and off throughout the year. After months of procrastinating, I finally read all of the specs on Thursday; it's wonderful that you can do so in the course of one reading session, I might add. Yesterday a bunch of us RDCers got together to chat (informally) about the specs: what they're for, how they work, and how they interact with one another. I learn by doing, by examples, so I combed through each of the specs in advance of our meeting and tried to construct a minimal repository[1] based on micro-services.
Continue reading…

Notes
  1. Perhaps it's more in line with the specs to refer to this space as "a managed filesystem that drives repository and curation services," given the CDL philosophy that preservation is not a place/repository. But it's easier to say "repository," so there you go. []


I2: Survey results

Posted by Michael Giarlo on September 15, 2009

I wrote in June that the I2 subgroup surveyed "repository managers to determine the current practices and needs of the repository community regarding institutional identifiers. Results from the survey will inform a set of use cases that will be shared with the community, and that are expected to drive the development of a new standard for institutional identifiers."

The survey closed in July, and the subgroup spent August writing a report on the survey results. That report is now final and it's available to the public. Feedback may be sent to our (woefully underutilized) public i2info mailing list, left as a comment on this post, or e-mailed to me privately which I can forward to our internal list.

The next step is to build upon the report to draw yet more conclusions from the data — there's an awful lot there — and flesh out some repository use cases for institutional identifiers. The I2 core group is moving quickly towards finalizing identifier metadata elements so that a standard may be drafted, and I think having some use cases documented will help drive the standard in a direction the community can get behind.

Onward and upward.

I2: Survey

Posted by Michael Giarlo on June 20, 2009

[Series]

Near the end of my strawman post, I wrote:

The I2 repositories subgroup will be sending out its survey on identifier use cases in the coming week. It will be interesting to see if the requirements we have thus far identified still obtain in light of the data we collect from the survey.

We completed the survey late last week and began distributing it. Here's what we sent out:

The NISO I2 Working Group is surveying repository managers to determine the current practices and needs of the repository community regarding institutional identifiers. We value your time and your input in the process to create a standard for a new institutional identifier. We hope that you will complete the survey which should take less than 15 minutes. The survey will remain open through Monday, July 6th.

Here is a link to the survey:
http://www.surveymonkey.com/s.aspx?sm=RGQgZ3090DVrb3kFzr3P3Q_3d_3d

Please feel free to share this message with other interested parties.

First we used Survey Monkey to send the survey link to approximately one-hundred repository managers that the subgroup identified. Our process for identifying repository managers involved pulling together a list of prominent repositories from subgroup members, and then gathering more from OpenDOAR, "an authoritative directory of academic open access repositories." Then subgroup members were encouraged to share the survey link with colleagues, and post it far and wide via blogs, listservs, and tweets. The listservs we targeted were: JISC-REPOSITORIES, metadataLibrarians, digital-curation, SPARC-IR, ir-net, REPOMAN-L, PALINET-IR-L, dspace-general, fedora-commons-users, DC-IDENTIFIERS, and code4lib.

I've already received a few responses and have gotten useful feedback. Two of the hardest questions to answer so far have been: "What is an institutional identifier?" and "What is a repository?"

Institutional identifier

An institutional identifier is defined as a symbol or code that uniquely identifies an institution. Domain-specific examples of existing identifiers include SAN, IPEDS, GLN, MARC Org Code, and ISIL. Another example might be a Handle prefix or ARK name authority assigning number.

Repository

Institutional repositories and subject repositories like arxiv.org are clearly 'repositories', but beyond that it is a somewhat ill-defined term. One might look to the Kahn-Wilensky architecture, or the OAIS reference model (PDF), or even Wikipedia for definitions, but it's not clear that even the authorities agree on what constitutes a repository.

It's a system. It's network-accessible and typically has a web interface of some sort. Files and groups of files sometimes known as objects tend to be deposited in them, perhaps for some combination of management, access, or preservation. Many run Fedora, DSpace, and ePrints, and factor heavily in scholarly communication. Some are document-centric. Some will accept anything. To some, a learning management system may be a repo. To others, a content management system may fit.

My background is in academia so my own definition is somewhat based in that context, but I wouldn't say the term is necessarily limited to that context. There are other NISO I2 scenarios for library workflows and electronic resources, so it's safe to assume that repository does not mean ILS or OPAC or ERP system. My hope is that folks have their own working definitions of the term and can decide for themselves what it means.

We've given folks a little over two weeks to respond to the survey, so the constant I2 drum-beating will quiet down for a while around here. I am very interested in what sorts of responses we get from the survey. Fun times!

Oh, and perhaps it goes without saying, but if you're a repository owner, manager, expert, developer, or stakeholder with an interest in identifiers, please feel free to take the survey!

I2: Strawman

Posted by Michael Giarlo on June 13, 2009

[Series]

In the prior I2 post, I wrote about the requirements the repositories subgroup has come up with for an institutional identifier standard (with the hope that our findings re: repositories could be generalized to other scenarios).

PhotonQ-Tim Berners Lee on Linked Data at TED
Image by PhOtOnQuAnTiQuE via Flickr

My strawman proposal of sorts is to explore how well linked data patterns fit this problem space. Linked data, briefly, is a way to expose and link data on the web in a more semantically meaningful way, and is often summarized using the four principles put forward by Tim Berners-Lee:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information.
  4. Include links to other URIs. so that they can discover more things.

That's the crux of it.  Linked data takes well-known patterns on the web (linking, dereferencing, etc.) and applies them to data, which in this case could be metadata for identifying institutions.

Let's examine each of the requirements and the applicability of linked data thereto.

  1. Should be agnostic to type of institution, e.g., libraries, museums, personal collections, historical societies: The web is already agnostic to type of institution.  HTTP URIs do not favor one type of institution over another.
  2. Should handle varying institutional granularity, e.g., institution-level, campus-level, division-level, unit-level: HTTP URIs are flexible in this regard.  Hierarchy, should one wish it to be surfaced in the identifier, may be encoded in either a DNS hostname or the path appended to the DNS name.  One can imagine a URI like "http://department.division.institution.tld/unit/subunit" or "http://institution.tld/campus/office/individual".

    Hierarchy needn't be surfaced in the identifier if one favors opacity, in which case "http://registry.tld/xnjsdasd" would suffice as an identifier, and may instead be entirely reflected in the (RDF) representation returned by dereferencing the URI.
  3. Should handle linking among institutions and subordinate units: Linked data handles linking via well-known HTTP mechanisms, referenced in the fourth principle of linked data.  Unlike the HTTP link, which has limited semantics, linked data links are semantically rich and extensible.
  4. Should express different sorts of relationships among these institutions and units: The "useful information" in the third principle of linked data is typically provided by an RDF representation, which is itself a list of assertions.  These assertions, or triples, consist of subjects, predicates, and objects.  The ability to express the relationships in this requirement is limited only by the availability of vocabularies that contain sets of predicates and classes for subjects and objects.  Think of the predicates as elements defined within a metadata standard, e.g., Dublin Core "creator", MODS "relatedItem", and so forth.  Vocabularies that contain these predicates and classes are growing and evolving daily, and should there not be a vocabulary that contains the relationship one wishes to express, it is fairly easy to create a custom vocabulary.

    The ability to mix and match vocabularies provides an expressiveness that is often not found in document-based metadata formats and the flexibility to express radically different relationships on a per-industry or per-institution basis.  This latter point is important as the I2 group has identified both core metadata elements for identifying institutions of different types and additional elements for specific types of institutions.  Why re-invent a new metadata format or schema when all one needs to express may already be contained in others?
  5. Should relate to existing relevant identifiers and registries: Same as requirement#4.  Linked data is all about expressing relationships between things, e.g., institutions, identifiers, registries, etc.
  6. Should be globally unique: HTTP URIs are guaranteed to be globally unique by virtue of the distributed DNS system and hierarchical naming within each HTTP service.
  7. Should be actionable: HTTP URIs provide dereferenceability/actionability via the well-known HTTP protocol.
  8. Should enable retrieval of metadata sufficient to identify the institution, which may vary widely by institution: HTTP URIs are actionable per requirement #7 and the metadata returned is flexible per requirement #4.
  9. Should accommodate changes as institutions come and go and re-organize and be able to relate defunct institutions to new ones: Linked data patterns provide for redirecting from defunct representations (institutional identifiers) to new ones via HTTP redirects.  One may also add assertions to institutional metadata such as owl:sameAs, for instance, which says that the institution identified by the given URI is the same as another institution identified by another URI.

This seems like a compelling path to follow for the I2 standard.

The I2 repositories subgroup will be sending out its survey on identifier use cases in the coming week.  It will be interesting to see if the requirements we have thus far identified still obtain in light of the data we collect from the survey.  If so, I would like to explore the idea of linked data for institutional identifiers a bit more.

I2: Requirements

Posted by Michael Giarlo on June 07, 2009

[Series]

The I2 IR scenario subgroup approached the issue of institutional identifiers in repositories by first brainstorming about the various issues, problems, and sticking points that make identifiers in this space (and elsewhere) such a complex topic. Folks on the subgroup are repository managers or are otherwise involved with or knowledgeable about the repository space, so the brainstorming exercise yielded a good number of concerns.

The purpose of the exercise was to enumerate concerns and issues that could inform a draft survey to be administered to repository managers and experts around the globe in different organizational contexts: libraries, subject disciplines, archives, historical societies, etc. The purpose of the survey is to get an idea of the use cases and constraints around institutional identifiers in these different repository contexts, the assumption being that we ought to have requirements grounded in real world usage before we go off building a standard.

I will note here that the subgroup has worked up a draft survey that has just recently been reviewed by a small group of folks who know about survey design, and we hope to administer the survey to the aforementioned Reporati this week[1]. Which is to say that I don't yet have a strong grasp of the use cases out there in the wild, and this series should be construed as my own premature cognitive fumblings. But let's assume for now that what we learn from the survey results matches our initial brainstorming exercise.

Here is a slightly modified and boiled down version of the concerns and issues the subgroup came up with for a potential institutional identifier standard, which resembles a set of minimum requirements:

  1. Should be agnostic to type of institution, e.g., libraries, museums, personal collections, historical societies
  2. Should handle varying institutional granularity, e.g., institution-level, campus-level, division-level, unit-level
  3. Should handle linking among institutions and subordinate units
  4. Should express different sorts of relationships among these institutions and units
  5. Should relate to existing relevant identifiers and registries
  6. Should be globally unique
  7. Should be actionable
  8. Should enable retrieval of metadata sufficient to identify the institution, which may vary widely by institution
  9. Should accommodate changes as institutions come and go and re-organize and be able to relate defunct institutions to new ones

I doubt the list is exhaustive; I am almost certain we will uncover all sorts of tangly and esoteric use cases that add requirements. I expect it. Why else would we be gathering to discuss the need for an institutional identifier if it were a solved problem or a simple one? [2]

Nevertheless, looking at the above list, the task we've taken on starts to feel less onerous. And thinking about identifier systems constrained by the list of concerns, the mind starts to cook up all sorts of possible solutions. I'll share one in the next post in this series, a strawman proposal of sorts, and how it addresses each of these requirements.

Notes
  1. We will also x-post to repo-related mailing lists as well, and some of us may blog or tweet about it. My inclination is to cast as wide a net as possible so as not to miss important use cases. We can always scope things out later on, but it's useful to be inclusive at this point lest our own assumptions carry the group forward. []
  2. The cynical among you might have interesting answers to this question. []