Leveraging the Digital Library for Publishing the Past and Future

Introduction

There is a powerful and perhaps controversial thesis that provides the underpinnings for my talk to you today. It is a thesis that has been implied in the work our library and several other libraries have done for the past few years:

Large research libraries have a responsibility to build and support the "infrastructure" necessary to enable republishing the past; and, to the extent that the past is always a part of the present and the future, that same infrastructure will prove invaluable in present and future publishing at our institutions.

What I will present to you is an overview of the digital library enterprise at one leading institution, but the activities and capabilities of the University of Michigan Library are in no way unique, and nor are the responsibilities. As the landscape of our world of digital information takes shape, it will be the digital library work of our leading research libraries that provides us with mechanisms for communicating much of our thoughts and work to each other, and in fact should also significantly influence the economics of that evolving world.

My discussion is structured as follows:

  1. I will first describe for you the basic text digitization and publishing systems at Michigan, trying to give you a sense of the scale and the cost-effectiveness. Because time is short and this scope is ambitious, I will not be discussing our image digitization and publishing systems, but similarly large investments are made in support of those methods of capture and delivery.
  2. I will turn then to a discussion of the way that these systems are being made available to immediate members of our community, and the way that these offerings have the potential to influence the economics of publishing our past and future.

Context

Many of you will be familiar with resources developed by my organization, Michigan's Digital Library Production Service. The Michigan Making of America collection now offers nearly 3 million pages of 19th century books (perhaps as much as 3% of US 19th century book publishing in English), and several hundred thousand pages of 19th century journals[1]. The Middle English Compendium is centered on the electronic Middle English Dictionary, a conversion project that took into account 75 years worth of work at Michigan and resulted in approximately 20 million words of carefully transcribed lexicography[2]. These collections and many others were not built from disparate project-oriented staffing and a "soft money" organization, but grew out of a large cooperative enterprise that is firmly established within the University of Michigan Library. The Library's Preservation unit, for example, redirected significant numbers of staff as part of the most recent round of Making of America. The Library's Cataloging, Serials, Collection Development, and Acquisitions units have all played pivotal roles in these major digital activities, and continue to do so as appropriate. Moreover, the Digital Library Production Service itself is firmly a Library unit, a part of one of three divisions within the Library, and consists of approximately twenty full-time staff, most of whom are on base funding (i.e., rather than grant funding or revenue). [illus: Org chart] Staff are divided into two primary areas of work. The Digitization group is comprised of specialists and support staff who are responsible for formats or methods, and who manage all aspects of that work including workflow and selection of methods. The Information Retrieval and Architecture group is responsible for building and maintaining systems for putting collections online, a range of tasks that ranges from system administration to search heuristics to system design and specification. In all areas of our work, we are likely to confront a "buy or build" decision: should we develop the system from scratch (or digitize the resource using in-house staff or equipment), or should we buy the service or system from another company or organization? Our mission is to facilitate the effective creation and maintenance of digital libraries, and so the answer to the question is never a forgone conclusion, and must be made within a given context. As a result, each year we contract for significant services and systems, at the same time that we locally process millions of pages and build some of the most significant digital library systems available on the market today.

Text Digitization

Our digitization processes are shaped by formally defined "standards" and community-based "best practices." For example, in our retrospective conversion activities that result in materials like those in Making of America, we strive to replace Preservation-based processes that would, in the past, have created microfilm. To have any credibility, especially in the library community, those digital processes must be defensible on grounds of permanence and fidelity, as well as improved functionality. Consequently, and as a result of over a decade of work and discussions, we would define the following as an ideal to which we might strive:

  1. Typical printed pages must be captured in all cases, with a resolution of no less than 600 dpi, bitonal bit depth, and using a formally recognized standard format that is relatively loss-less (ITU Group 4 TIFF). [illus: http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?idno=ael2934.0001.001;c=moa;seq=00000263;view=pdf]
  2. Illustrations with tonality or color must be captured with appropriate methods, including grayscale at similar resolutions or, for color, resolutions and bit depths rivaling traditional photography; e.g., color fold-outs in MOA are photographed using a large format (i.e., 4x5) camera and the film positives are digitized at roughly 600 dpi using effective scale and color management. [illustrate]
  3. The text itself should be captured using methods--keyboarding or corrected OCR--that would result in levels of accuracy comparable to high quality publications, typically at 99.995% accuracy or greater (i.e., error rates that do not exceed one in 20,000 characters).
  4. Encoding of text must be full XML or SGML, ought to conform to the TEI Guidelines, and must represent all major and minor structures (e.g., chapters and lists), as well as typographic variation (e.g., shifts in fonts). [illus: http://www-personal.umich.edu/~pfs/eebo/dox/samples.html, specifically http://www-personal.umich.edu/~pfs/eebo/samples/samp7.html]
  5. A copy of all converted material must be stored on gold CD-ROMs using ISO9660 naming conventions. Preliminary NISO-associated testing conducted by Kodak has shown that gold CD-ROMs should have at least a 200 year life before they begin to degrade.[3]

These are, of course, ideals, but there are very few cases where Michigan does not meet or exceed these benchmarks. We aim for a balance of cost and functionality without compromising the long-term viability of the conversion. So, for example, while these five steps were performed in exactly this form for the Corpus of Middle English materials, in Making of America we have chosen to use high quality uncorrected OCR rather than striving for 99.995% accuracy. Even taking into account economies achieved by scale, the difference in cost for uncorrected OCR and text corrected to 99.995% accuracy is phenomenal--nearly 100 times more expensive for the very accurate text. To this list of methods, then, we would add uncorrected OCR as an alternative for the third component, describing the approach taken at Michigan as:

3a. Cost-effective uncorrected OCR should be employed where possible, using methods that are as accurate as possible. Research in the area of OCR consistently finds that "voting" systems that employ multiple engines are far more effective than single OCR engine systems.[4] At Michigan, we use PrimeRecognition's PrimeOCR[5] with six commercial OCR engines, and in tests of accuracy in MOA, we have found that accuracy rates typically exceed 99.8% accuracy.[6] [illus: http://www.hti.umich.edu/cgi/t/text/pageviewer-idx?idno=ael2934.0001.001;c=moa;seq=00000263;view=text]

Where is our capacity: In-house or Outsourced?

The volume of work (rather than the method) typically determines whether we perform an operation in-house or outsource the work. There are of course exceptions, but in the last year we outsourced over three million pages of scanning, and nearly all of our 99.995% accurate text was accomplished by paying for "keyboarding"--i.e., having the material typed for a cost of approximately $1 per thousand characters; this is in contrast to only thousands of pages scanned in-house, and in-house OCR proofing only when the original OCR was so accurate that the need for human intervention is minimal. On the other hand, we perform all of the millions of pages of uncorrected OCR each year ourselves, and could conceivably "ramp up" to handle several times what we're now doing cost-effectively; all of our photographic imaging is performed in-house, though this is typically because the materials (e.g., papyrus) cannot be sent out.

The scale of Michigan's work in all of these areas is remarkable not for the absolute numbers (which could not rival a commercial operation), but for the numbers relative to the breadth of these activities--few commercial firms operate in all of these areas--and the fact that we do this work in direct association with our mission. In 2000, we

Retrospective conversion and Modern Publishing

How do these "ideal" methods for retrospective conversion differ from what one would recommend for a modern publishing process? In fact, the only notable area of difference is the capture of page images, which would typically be unnecessary for a modern publication. In all other cases, however, the standards and methods we use are seen as ideals that many publishing operations strive to reach. Differences are small and relatively insignificant: for example, where a publisher might use the Open e-Book DTD or a variant of the ISO12083 publishing DTD, DLPS uses a compact subset of the TEI; color values in images for publishing are typically registered in a CYMK color space, while (because of our digital orientation) we use an RGB color space. Because of the long-term responsibility we have for the materials, our approaches in general tend to be more rigorous and painstaking; our methods are, nevertheless, "compatible" with the best approaches in the publishing world.

"Electronic Publishing"

What is publishing and how does the work of putting online those materials we create differ from what a publisher does? A conventional publisher clearly performs more tasks than does an operation like ours. For example, we do not actively market or promote the materials we put online. Nevertheless, we do perform most other operations, including text creation, negotiation with "authors," copy-editing, and even the digital analogue of printing. I would, then, like to discuss our efforts to make the resources available electronically as "publishing."

What is DLXS?

DLPS develops and distributes a significant system for access to text and image collections. The systems are distributed through a program we call the Digital Library eXtension Service, or DLXS. DLXS consists of a commercial search engine, whose rights we have secured so that we can extend its functionality and contain costs, and a rich collection of software that provides an interface to the collections and the search engine. We are not permitted to "give away" the search engine, but the real work of integrating the materials is in software we develop, and is thus owned in its entirety by the University of Michigan; consequently we distribute this "middleware" through free Open Source mechanisms. We have only been distributing the software in this way for a short period of time, but already other institutions are beginning to make contributions to our development process. The search engine, XPAT, costs $15,000 one-time for a single server license ($5,000 per year for updates and support), a hefty price for many, but a tiny fraction of its price prior to our acquisition, and a relatively small amount compared to other large enterprise search engines (typically more than $100,000).

Text Class

The Text retrieval system, which we call Text Class, has been designed to support a broad range of textual resources. We have a working hypothesis, always tested, that one system can support a continuum of textual materials, and that the materials that make up Making of America sit at one end of that continuum, while modern books and journals sit at the other end. Two of the shaping characteristics that we consider are:

  1. Structure: Textual materials have structure, but whether historical artifacts converted as part of a preservation process or modern publications, that structure may be minimal or considerable. For example, the volumes in Making of America are typically constituted only by OCR and page images, with references to the page images embedded in the OCR. The structure in the book is, minimally, the body of the book and the content of the individual pages [illus: http://www.hti.umich.edu/cgi/t/text/text-idx?c=moa;idno=ABB3816.0001.001;view=toc]. Of course a historically converted text like those in the Corpus of Middle English may have considerable structure encoded [illus: http://www.hti.umich.edu/cgi/c/cme/cme-idx?type=header;idno=LoveMirrour]. A modern publication such as those published in the ACLS historical monographs project may have no more internal structure than the Making of America volumes, or may indeed have a complex structure. A journal, whether "born digital" or converted as part of Making of America, will typically have a volume, issue, and article structure, and the articles themselves may or may not be internally structured [illus: http://www.hti.umich.edu/m/moajrnl/browse.journals/sout.1836.html]. The fact that this surfeit or paucity of structure may exist in either a modern publication or a historical artifact causes us, as system builders, to need to accommodate a continuum of structure for any collection of publications, regardless of its date of publication.
  2. Layers or representation: Making of America materials are displayed, by default, as page images, with the OCR suppressed. This sleight of hand (searching one form and displaying another) might seem unique to materials we convert, but it has been an irony of our work that we have encountered this need in "born digital" materials as well. In fact, Making of America materials may be displayed as GIF images, PDF files, or OCR. The University of Michigan's Scholarly Publishing Office now works with modern editors to create publications that contain searchable XML-encoded text but display only PDF and page image files [illus: http://www.philosophersimprint.org], and the ACLS monograph project may in fact make all of these formats available to readers. What is searched may always be encoded text, but what is displayed should always be configurable and may be multiple layers.

 

The system we have developed does indeed support the use of a continuum of publications, loosely or highly structured, modern or historical, with any number of layers, and does so in a variety of ways. My examples have shown you how the materials can be displayed, but "display" can only take place once successful "discovery" has occurred. We believe that searching is the heart of a successful system, and thus devote significant portions of development resources to creating effective searching. As the body of materials grows larger, the need to be able to do the following becomes necessary:

Facilitating reading and printing of digital publications is powerful, but integrating new methods of searching and browsing can transform the way that we think about publishing and publications.

The Economics of Scale and Public Goods

Accomplishing all of this is not inexpensive, but by capitalizing on large scale conversion and shared development, but turning what is essentially a public good into a foundation for analogous activities in other organizations or institutions, we are able to make high quality, permanent digital resources available at a low cost. To help facilitate the process of writing grants and calculating costs within the University of Michigan Library, we have developed UM-approved recharge rates. These are intended to reflect the internal cost of doing business and cannot be offered externally; moreover, the rates change with changing scale, changing staff costs, and changing technologies. Nevertheless, they are helpful to consider in this discussion.

Scanning (for brittle, hand-fed materials)[7]$0.13/page
Uncorrected OCR[8]$0.04/page
Film positives$8.05/per shot, incl. services and materials
Scanning film positives$9.91/per image
Writing CD-ROMs$9.47/per disc

Those of you who have paid for conversion will appreciate the extremely low costs I am reporting here; they are typically far less than commercial rates, or (in the case of scanning) are among the lowest rates that can be secured from commercial operations.

Commercially available systems for delivering digital library content are increasing in number. Single purpose-built systems, for example for dealing with high resolution color images, often cost $100,000 per year or more. Publishing systems with similar functionality are expensive to buy or to build, and the cost to the University of Michigan of developing DLXS is not cheap. With two full-time programmers, two collection specialists, and most of an interface specialist devoted to the enterprise, one can readily see that the cost of developing DLXS rivals the purchase of a commercial system. On the other hand, the cost to the University of Michigan is defrayed through the sale and support of the XPAT search engine. Moreover, by using free Open Source distribution methods, other institutions benefit from our development efforts at a fraction of the cost (sometimes at no cost), and increasingly contribute new development to the system.

Many of the resources that come out of the digitization and system building efforts are made available to the world at no cost. The Making of America is a remarkable example, where Michigan is effectively tearing down the walls of the library, making available the contents of its library to individuals and institutions around the world at no cost. Many of the new publications we put online in collaboration with scholars or institutions have also been made available at no cost, in large part because we are typically able to make our host service available to them for free when their resources are offered for free. This has happened with new electronic journals and with scholarly projects, as well as with retrospective conversion projects. Hardware and software costs are already attributed to other Library projects, so that only a small fraction need be borne by the publisher's product. If system changes are negligible, simply loading and providing access to large amounts of data may cost less than a few hundred dollars per year.

Not everything can be "free," however, and cost recovery is often a necessary part of an initiative. For example, the Middle English Dictionary, a publication of the University of Michigan Press, is made available through subscription mechanisms to pay for the ongoing staff costs, and the Bibliography of Asian Studies (from the Association of Asian Studies) requires ongoing funding for its operations, which we facilitate through subscription services. In these cases, we charge for the services we provide, but our charges are remarkably low. We are able to do this by calculating the cost of our services based on their marginal costs. If, for example, we are able to use existing hardware and software, the only cost to the publisher is the cost of loading data, tailoring the interface to the publisher's requirements, and managing the subscription service. Our costs are low (less than one-third the amount typically charged to academic projects), the marginal cost of assisting this new academic project is low, and this helps to make the subscription price very low as well.

The future of libraries is in digital collections, both the collections of its past and its future collections. By doing their work well and effectively, research libraries such as the University of Michigan's can build an infrastructure that is equally well-suited to recapturing the past electronically and creating a digital future. By forming partnerships with appropriate academic partners, we can share the benefits of these large-scale, cost-effective efforts. In doing so, we will not only shape the information landscape, ensuring the wide availability of high quality publications, we will also shape the economics of that landscape, with much of the material made available at no cost, and the remainder at low cost.

While that effectively ends the formal part of my presentation, I would like to add an addendum regarding the establishment and work of Michigan's new Scholarly Publishing Office. In part because of changes consistent with what I have described, the University of Michigan Library has created a new organization that reports to the head of the division of which DLPS is a part (i.e., reports to my boss). This new organization, the Scholarly Publishing Office, is responsible for cultivating academic publishing efforts like many of those I have described and at least two of the examples I touched on. The infrastructure used by the Scholarly Publishing Office is, overwhelmingly, that created by the Digital Library Production Service. Their work is both a vindication of the model I have discussed and, I believe, a harbinger of things to come.

John Price Wilkin

April 5, 2001

Notes

[1] http://moa.umdl.umich.edu/

[2] http://ets.umdl.umich.edu/m/mec/

[3] http://www.kodak.com/global/en/professional/products/storage/pcd/techInfo/permanence.shtml

[4] Wilkin, John. "Enhancing Access to Digital Image Collections: System Building and Image Processing," in Moving theory into practice: digital imaging for libraries and archives edited by Anne R. Kenney and Oya Y. Rieger.

[5] http://www.primerecognition.com/

[6] http://moa.umdl.umich.edu/moaocr.html

[7] This is not an established recharge rate; however, it is a typical cost we pay through contracted work on a large scale.

[8] Our currently established recharge rate is $0.08 per page, but the volume of work performed in FY 2000 should drop that figure to $0.04 in the coming fiscal year.