Skip to Main Content

Digital Scholarship: Inside Digital Scholarship

Revisiting OCR

by Matt Phillips on 2025-02-06T16:11:00+00:00 | 0 Comments

Like many GLAM (Galleries, Libraries, Archives and Museums) institutions, the University of Southampton Library has accumulated and generated a number of digitised collections. Developed over the last two decades, we have often made use of the cutting-edge technology of the day to put our treasures online. Of course, the technology available to us in the early-2000s has come a long way, such that the capabilities of today’s LLMs (large language models), AI assistants and computer vision tools seem miraculous by comparison. We have more computer power, more bandwidth, more intelligent tools, more electronic resources to draw on, but our data (the full text of digitised volumes) was generated using OCR (Optical Character Recognition) software 8 years ago. In fact, the digitised British Official Publications (Ford Collection) we’re known for and at one time were named for, was largely processed for OCR in the early-2000s and then again in 2016, in an attempt to improve and broaden access. So, is it time to reprocess our digitised collections again? Has OCR technology really improved that much? Given that reprocessing approximately a million pages of text is a significant undertaking we were keen to be guided by data.

 

Image showing an official publication with faint type and poor OCR output

Official publication with faint text. From: General abstracts showing the acreage under crops and the number and description of live stock in each county and province of Ireland, for 1900-01, 1902.

 

Focusing on printed material, we sampled our digitised collections, reprocessing a selection of items with our current OCR server (ABBYY FineReader Server 14) and compared error rates. We identified where characters were incorrect or undetected, where the order of characters was wrong, and where blemishes on the page were incorrectly identified as text. We saw reduced error frequency across the board, with the reprocessed sample having around 30% fewer errors than the existing full text. We also noticed improvements in layout recognition, particularly where text was arranged across columns. Tabular data still proved a difficulty for the up-to-date OCR server and all items showed some degree of error. The potential improvements to accessibility and computational access to be gained by reprocessing our digitised collections clearly indicated a need to reprocess our images.

 

Sample of ALTO XML

Sample of ALTO (Analysed Layout and Text Object) generated through OCR processing.

 

While recognising the value of reprocessing, we needed to ensure that it didn’t adversely impact business as usual OCR by blocking access to the server. We divided the OCR jobs into batches that could be processed overnight and during weekends. This was done using a Python script running as a cron job from our automation server. The script downloads images from our Digital Library platform using an API, and then sends the images to our OCR server. The OCR server returns ALTO (Analysed Layout and Text Object) xml files containing the text and positional coordinates in each image, which the script then packages ready for re-ingest into the Digital Library at a future date. At this pace we expect the work to take a couple of years to complete. As well as improvements in the accuracy of OCR, we’re also taking this opportunity to detect non-text elements such as illustrations or diagrams. With this we should be able to use AI to describe the images and hopefully provide new avenues into our collections.

 

 Add a Comment

0 Comments.

  Subscribe



Enter your e-mail address to receive notifications of new posts by e-mail.


  Archive



  Return to Blog
This post is closed for further discussion.