Harry Potter and the Responsible Version Control of Digital Surrogates


In Harry Potter and the Deathly Hallows (sorry), Helga Hufflepuff’s goblet is stored in a vault at Gringotts that’s been cursed so that every time you touch one of the objects in it, dozens of copies are created. On the cover of the original U.K. edition of the book, Harry, Ron and Hermione are pictured scrambling atop a wave of facsimile’d treasure. I’ve started thinking about special collections digitization like this. Digitization doesn’t streamline or simplify library collections; rather, it multiplies them, as every interaction creates additional objects for curation and preservation.

Suppose you digitize a manuscript. Until now, you have been responsible for preserving and providing access to the manuscript itself. Now, you have to keep doing that, but you also have to preserve and provide access to the images, in any number of delivery and archival formats, and to the structural and descriptive metadata that allow the user to make sense of the digitized object. You might also want to preserve checksums and validation logs, in order to monitor the health of the facsimile. Suppose you create a website for the digitized manuscript, with a full-text transcription and articles about the manuscript’s provenance and a little Flash animation of the manuscript in 3D. Well done; now you have to preserve and provide access to all of that, too.

Then suppose, twenty years down the road, your library gets a grant to redigitize the manuscript using the latest and best equipment. 100-megapixel camera, £30,000 lens, hyperspectral imaging, the works. The new images are indisputably better. The old ones served their purpose, and they’ve been reproduced in countless scholarly and commercial publications, but they were lower-resolution, slightly oversaturated, and they didn’t include scales or colour charts. On top of that, the old website has given rise to new research about the manuscript, and the catalogue entry—and the derivative descriptive metadata for the digital surrogate—need to be updated.

What do you do now? Do you replace the old images outright? Do you preserve them privately (your conservation department might find them useful, if nobody else), but share them with researchers only on request? Do you put the new facsimile online alongside the old one, as a separate digital object, and cross your fingers that everyone will be able to identify the one they need? Do you try to merge the image sets into a single facsimile, with alternate images and metadata for each page of the original object, so people can cite and link to both? OK, that’s a nice idea, good choice, but how are you going to do that, with the resources you have?

A real-world example: in 2009, the Bodleian collaborated with Artstor to digitize its manuscript illumination slide library. The slide library itself was a bizarre thing, a legacy of already-mysterious choices. In the 1950s and ’60s, under William Hassall, a group of untrained volunteers leafed through the Bodleian’s medieval western manuscripts, cataloguing each illuminated page. The result was a massive iconographical card index. Researchers began ordering images of items in the index, and the Bodleian kept a copy of each image. Under Hassall’s guidance, the Bodleian studio also began creating themed filmstrips containing reproductions of, for example, cats, or windmills, or miniatures by 15th-century Flemish masters. The one-off images and the filmstrips—collections of about 50,000 images in total, amassed over decades—were catalogued for purchase, so that aademics, picture researchers and interested members of the public could walk into the Bodleian shop—or write to the imaging studio—and buy a roll of images of children playing games in medieval books of hours.

When the slides were shipped off for digitizaion in 2009, Artstor hired an art historian to correct and expand Hassall’s volunteers’ image captions, and a magician—presumably—to clean up the slides themselves, which had tended to go purple over time. The slide captions were also enriched with dates and places of origin from Pächt and Alexander’s catalogue of illuminated manuscripts. The images were published online using Luna, a then-new platform that coped well with extensive image-level metadata. Only about half the images were actually digitized; perhaps the others weren’t in good enough shape, or were of items of less interest to Arstor. Nevertheless, the result, a collection of 25,000 images, so well catalogued that you could search for “unicorn” and get dozens of images of unicorns, was an enormously rich resource, the latest in a series of initiatives—iconographical index, manuscript catalogue, slide library—that brought the Bodleian’s illuminated manuscripts into the reach of people who wouldn’t otherwise have known they existed.

It didn’t take long for the digitized slide collection to become the bane of the Bodleian digital team’s collective existence. We didn’t have an ongoing support contract with Luna, and we hosted the data on an already-old machine that, by the time our Digital Preservation Fellows arrived in 2016, was held together mostly by fear. (These are things we would do differently now, of course, but I imagine we would have done them differently then too if we’d been able to afford it.) There were a number of problems with the slide metadata, from incorrect date statements to entire slide rolls where the wrong description was attached to each image, but we couldn’t get into the database to change anything, so when helpful readers or curators came to us with errors, all we could do was add them to a list.

Then the Digital Preservation Fellows arrived, hacked into our Luna database, extracted all the metadata, copied the slide images off a hard drive they’d found in a filing cabinet, and put everything on my software engineer colleague’s desk. It took about a year, working on-and-off, to clean up and migrate the data. This included hours of normalizing date and language fields, extensive conversations about how to shoehorn information about previous surrogates into a MODS record, a visit to the imaging studio to see the original slides (now in stacks of black crates), and a 20-line regular expression to gather all the different filmstrips’ images of a given manuscript together and put them into folio order.

In our early conversations about migrating the slide collection, we’d considered matching the Luna metadata with newer images where we had them. Several of the books and manuscripts featured in the slide collection had been photographed properly in the last ten years, meaning it might be possible to create a sort of super-facsimile of a kind hitherto unknown to us, containing excellent images and excellent metadata. We discarded this idea partly for practical reasons (it would be difficult to match images and metadata, and even more difficult to consolidate metadata in cases where there were multiple Luna records matching a single newer image) and partly because we decided that we needed to hold onto the slide images. The thing is that they are bad images. Artstor’s cleanup efforts notwithstanding, they are scans of 40-year-old 30mm film reproductions, and they are generally grainy, dim, and at least slightly the wrong color. (We don’t need to go to Google Images for our Yellow Milkmaid Syndrome.) For scholars working with manuscript facsimiles, citing a particular reproduction is a way of showing one’s work, and it would be tremendously misleading, after decades of presenting a certain set of images as official facsimiles, to suddenly swap them out for something radically different, even if radically different meant radically better. Confoundingly, the slide images must be kept because of their badness; they’re preservation-quality because they aren’t preservation-quality at all.

When it comes to preserving badnesses, where do you stop? Those slide rolls with the wrong descriptions attached to each image—the ones where a dragon is described as Christ walking on water, or one medieval children’s game is almost undetectably swapped for another—these errors might also have been cited in scholarship. The Luna interface itself, which made sense as a way of flicking through slide images, but never allowed you a wide-angle view of a whole digitized manuscript—this no doubt influenced interpretation of the images, so might it not also require preservation? In this particular case, we quietly fixed the metadata errors, although we still maintain a Git repository of all the Luna data, so we can locate previous versions of records if we need to. And our preservation fellows did take a screen capture of our Luna interface, although I’m not sure when or how we would make the capture available to the public. But once that was done, we turned on our link redirect service, which maps most-but-not-all of the old Luna URLs to the images’ new Digital Bodleian permalinks, and we shut the old server off.

There isn’t necessarily an expectation that libraries preserve records of interactions with their collections, of course. We don’t preserve stack request slips, and when the slips are replaced by an electronic circulation system, I doubt we’ll make an archival recording of the very last slip being filled out. But our conservation team does keep records of interventions and condition reports, and we know that the ancient procedural artifacts that simply happen to have been preserved—the medieval book lists, the accounting tables in the Cairo Genizah, the names doodled on the flyleaves of priceless books—are some of the most important things in our collections for helping us understand how people thought and acted. We also know—here, again, our own collections bear the evidence—how wrong we can be about how to treat old things, whether it’s selling our First Folio as soon as the Third Folio comes out or painting Hebrew manuscripts with a serum to treat iron-gall ink burn that eventually turns the whole page black. Knowing these things, and knowing that we can preserve at least the images and metadata behind every previous version of a digital surrogate, how hard should we try to do it? How much should we prioritize doing it? Given the cost of high-quality manuscript photography, I don’t imagine that we’re soon going to be in a situation where we have more facsimile images than we can afford to store, but we might someday, and what then?

The proliferation of digital surrogates etc. etc. means that right now there are five separate Digital Bodleian objects corresponding to MS. Bodl. 264. One is a full set of images captured in 2015 for a print facsimile; one is another full set of images (boards excluded) captured in 2008; two are the Luna slides from various rolls (divided into two objects corresponding to the two sections of the manuscript); and one is a pair of images of the boards. If you’re looking for a specific part of the manuscript, your best bet is to use the slide version to find the folio number for the text or decoration you want and then look for the same folio number in the 2015 version. But not everybody knows that, and even if they did, they might not ever actually do it. If you search Digital Bodleian for “unicorn”—which you can now do reasonably effectively—you probably aren’t going to think, Ah, but I wonder if there’s a higher-quality facsimile of that same unicorn image, hidden under a different item with the same shelfmark for some reason.

I know how I want to fix this. Like I mentioned earlier, I want to merge all the images into a single sequence, so that for each page of the original object, there’s a default image—the newest or highest-quality one—and then a number of alternate images that the user can click between. IIIF already supports something like this, although the as-yet only user interface for it is rather fiddly, and although if I think for ten seconds about implementing it I arrive at horrible questions (Where do we get the image names from? How do we divide up the metadata for the different versions? Do we give each image a unique identifier? We’d have to, right, if we wanted to make them citable?).

While we figure out version control, I think we could start by providing context. Until we started migrating the Luna slide collection, I’d had no idea where the metadata came from or how the images had been chosen, or even how old the slides were. Most digital surrogates are like this, presented without context as the sole authoritative digital presence for that object. Providing three or four different digital surrogates for the same object is miserably confusing from a UX point of view, but it does at least help to break down that illusion of authority. Imagine if, alongside each different digital surrogate, we provided context: why this object was chosen for digitization, and how and when the images were captured, and how they were first delivered to the public, and what’s happened to them since then. It would be a long story in some cases. But it would be useful—as a record-keeping exercise for us, if nothing else—and in comparison to 200MB images, text is cheap.