Exploring curation micro-services
As far as I'm concerned, the most exciting developments this year in repositories and digital curation have come out of the California Digital Library. It has been impossible not to notice their papers and presentations. Put simply, their idea is that digital curation is enabled by "micro-services" built upon well-known abstractions such as the filesystem. The benefits are obvious: filesystem tools are ubiquitous and cross-platform, and there are strong market forces to ensure the filesystem persists. The idea is radically simple and straightforward, though many questions remain about such a paradigm. I'll return to those later.
If you have not yet taken a look at CDL's curation micro-service specifications, most of which may be printed on as few as one or two sheets of paper, see the Digital Library Building Blocks.
My co-workers in the LC Repository Development Center have been chatting about these specs on and off throughout the year. After months of procrastinating, I finally read all of the specs on Thursday; it's wonderful that you can do so in the course of one reading session, I might add. Yesterday a bunch of us RDCers got together to chat (informally) about the specs: what they're for, how they work, and how they interact with one another. I learn by doing, by examples, so I combed through each of the specs in advance of our meeting and tried to construct a minimal repository[1] based on micro-services.
Here is a tree visualization of the final product, inevitable warts and all:
The services I used were Namaste, Content Access Node (CAN), Pairtree, Dflat, Reverse Directory Deltas (ReDD), Class-based System for Managing Object Properties (CLOP), and BagIt (co-developed by LC and CDL).
As I mentioned in our Friday meeting, recounting my experience exploring the specs: the bad thing is that I spent an hour building a repository with rudimentary tools such as mkdir, touch, cp, ln, and emacs; but the good thing is that I built a repository in one hour using common, rudimentary tools. It's a very compelling paradigm. Ed's already built a tool implementing some of Dflat, further demonstrating how lightweight these micro-services are. (UPDATE: Ed notes that this code is a work in progress and is "barely functional.") (UPDATE 2: The dflat library has come a long way. Check it out if you're interested. Also, I just committed a pretty basic Namaste library: http://github.com/mjgiarlo/namaste. Only took about an hour, which is a testament to the power of lightweight specs.)
I am certain this will be a running thread at work as the specifications evolve and our understanding of them grows. Some questions and comments that occurred to me while exploring the micro-service specs and building the minimal repo:
- CAN was a bit puzzling. The spec is simple enough, but I found some of the conventions confusing, and I was left wondering what CAN provides other than a container. What I would like to see is a simple use case and perhaps more examples. Thus, the CAN stuff in my sample repo doesn't feel very useful only because I had a hard time working with the spec.
- CLOP feels like the least mature of the specifications. It seems generally useful to be able to put digital objects, however you define that, into classes and define properties on those classes. The spec did not clearly convey to me just how it accomplishes that aim. A few examples would go a very long way. I've got some CLOP stuff in the sample repo but I have no idea how close my implementation matches the spec.
- Is Dflat dependent on ReDD? One would assume not since there's an optional property in the dflat-info.txt file for specifying a delta scheme. But, say, could you stub out the v001 directory (reserved to hold the initial version of a digital object) and use a system such as git or bazaar?
One might argue that these established delta schemes, if you want to call them that, have many more developers and users than a system such as ReDD and thus should persist longer and have more tools built around them. I imagine the micro-service viewpoint would acknowledge that point, but counter that the spirit of these specs is to avoid dependencies from outside the filesystem? - Is the ReDD specification meaningful outside of a Dflat given that any one ReDD directory knows nothing of its successors and predecessors, or is it dependent upon Dflat?
- Could a BagIt bag live inside of the ReDD reserved "full" directory? That is, could the "full" directory be marked up appropriately to be a BagIt bag?
- How many tools exist for these specs? I notice there's code in CPAN for Pairtree and Namaste, which is a fabulous start. Tools are the difference between YAMF (Yet Another Messy Filesystem) and reliably managed curation services. Granted, tools such as cp and emacs already exist and are part of the appeal of these micro-services, but there's also tremendous room for error if operations are all done "by hand."
- To what extent has CDL transitioned to using these specs/tools?
- Are other institutions using these specs/tools? I have heard tell that digital library folks from the University of Michigan and the University of North Texas may be involved.
I hope I don't sound overly critical. I'm really glad our colleagues at the California Digital Library have written these specifications and applied their deep experience to what could be a transformative paradigm[2] in the digital curation world. Kudos to them!
Notes
- Perhaps it's more in line with the specs to refer to this space as "a managed filesystem that drives repository and curation services," given the CDL philosophy that preservation is not a place/repository. But it's easier to say "repository," so there you go. [↩]
- Please excuse the fanboyishness; this filesystem fetishism is exciting stuff! [↩]
Trackbacks
Use this link to trackback from your own site.

Michael,
I've been looking at this too.
I think most people know that your assets are going to outlive (hopefully) your software. I've been running Zope, DSpace, TKL (IndexData), and Omeka sites among others, and while they are all good, I keep thinking "what's going to happen when you reach the end of the road"? To me this is why the CDL things are so interesting.
I'm just small time and the folks at CDL, along with your self and the co-workers at LOC, have a whole lot more experience than me, so I don't always understand what's going on ;-)
Some of the questions I'm looking at right now are:
* do you serve out of your DFlat"? Why not?
* If you think of the OASiS reference model is this the AIP?
* do you build every app this way?
I'm thinking you have/build lots of "feeder apps" (think BibApp) that have their own models but support harvesting (OAI, ATOM, etc.) and then on the harvest you store in this model. On top of this file system model you can build other structures (RDB, XML, Fedora, what every suits your fancy, etc.) that support access to this store.
I'm having trouble with pairtrees and identifiers (ARK). If (big if) I serve out of this structure I want reasonable URLs (wcsu/library/archives/MS044/whatever.xml) how do I use these numbers (010203945065)? OK, unless I make those numbers "intelligent" (01=college, 02=division,03=format, you get the idea) then the storage on the file system is just as opaque as if you used a database. I understand the need for unique identifiers but do we have to access "only" by identifier (alternative is to use a resolution service I guess but I don't know).
It seems to me that the CDL folks don't like XML (probably for good reason) but I'm leaning toward and XML AIP, store assets on the filesystem (accessible by humans), AND have the ability to serve from that store.
We'll see….
–Brian
@Brian: You ask some good questions. I'd be eager to see the CDL folks' response. I understand they're working on a response, but are otherwise busy with a few local conferences for the next week or so.
Also, I don't think Pairtree makes any assumptions about your identifiers, so using semantic ids (or "reasonable URLs" as you put it :) ) should be possible. You shouldn't need to use ARKs or opaque ids.
The filesystem is slightly less opaque than a database with opaque identifiers because you can still visually scan your store of objects and see your identifiers surfaced in the directory listing, the theory being that filesystem tools will likely outlive particular DB servers or schemas.
There are other specs for surfacing other information in the directory listing such as NAMASTe so that your object store is less opaque, but I'm less convinced of the value of some of these other specs. Still, it's a compelling pattern, and will be poking at all the specs to see how it all plays out.
I don't think you have to access only by identifier. The great thing about the micro-services is that they are, at least in theory, decoupled. So you can choose the ones that work for you and ignore the ones that don't. What's neat about the specs is that there's some overlap, and you have multiple options for storing a "digital object" on the filesystem, so if you don't like identifier-based access and you do want versioning, you can choose to model objects in DFlat w/ ReDDs rather than Pairtree.
Mix, match, rinse, lather, repeat!
Are there any areas in the CDL specs where you see an opportunity make good use of XML?
Thanks for writing, Brian.
This a follow up to my above post.
I believe in the concepts that the CDL specs represent I just didn't know how I could make them work. Well I think I'm a little closer. I've been looking at Ed Summer's "dfat.py" that you mention in this post and also Ben O'Steen's pairtree.py and they have helped me get a better understanding of how these specs function.
My big realization was that, if I can construct a really good metadata record made up from all the parts of the dflat, I can store objects in the repository, collect that metadata in the tool of my choice, and provide access to those object from an arbitrary set of urls.
Case in Point: I'm using IndexData's Zebra for several projects. Zebra is lightweight, stores the complete xml, but lets you retrieve the data in any format you can write an xslt stylesheet for, and provides an SRU interface. I use it as an XML database with a SRU interface.
So what I'm thinking is, I'll store my assets in a pairtree (I need to build the ingest using Ed and Ben's tools), and then point Zebra to the tree and let it index it (only looking for xml metadata files). I'll then create a public site (I'm using Zope with xslt-methods) that contains a "splash page" for the collection, a query page (SRU) [or combine the two], and then write xslt to handle the results of the query. The public site, while serving the assets from the pairtree repository, is essentially independent of the asset store and vis-versa (just what I want).
I think I'm going to create a pairtree site for each collection (good for humans) but harvest the metatdata into a central Zebra index. This way each collection can have it's own look and feel but the collections can be federated for search.
The critical thing here is getting everything you need to present (but nothing more) into that metadata record. I've been looking at UNT's model but I think you just have to do what's right for you and then convert to the standard models (DC, OAI-DC, LOM, etc.). Is METs still the standard or do we use OAI-RE?
While I don't want to add too many smarts to a pairtree one thing that might be nice is if you could ask the pairtree for all it's meatadata files in an Atom or OAI format. That way, like the current Omekasoftware that can display content from a feed, you can present this data as needed.
FWIW (for what it's worth),
–Brian
[...] notion of curation micro-services, and how they enable digital preservation efforts at CDL. Several folks in my group at LC have been taking a close look at the CDL specifications recently, so getting to [...]