Literature mark-up

David Morse, David King & Alistair Willis (OU), Guido Sautter (KIT)

Key resources
Penev, Lyubomir, Christopher Lyal, Anna Weitzman, David Morse, David King, Guido Sautter, Teodor Georgiev, Robert Morris, Terry Catapano, and Donat Agosti. "XML schemas and mark-up practices of taxonomic literature." ZooKeys 150 (2011): 89-116.
Willis, Alistair, Dave Roberts, David King, David Morse, Anton Dil, and Chris Lyal. "From XML to XML: The Why and How of Making the Biodiversity Literature Accessible to Researchers." In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), edited by Nicoletta(Conference Chair) Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis and Mike Rosne. Valletta, Malta: European Language Resources Association (ELRA), 2010.
Penev, Lyubomir, Donat Agosti, Teodor Georgiev, Terry Catapano, Jeremy Miller, Vladimir Blagoderov, David Roberts, Vincent S. Smith, Irina Brake, Simon Ryrcroft et al. "Semantic tagging of and semantic enhancements to systematics papers: ZooKeys working examples." ZooKeys 50 (2010): 1-16.
TaxonX XML Schema
TaxPub XML Schema

Traditional publication has led to a vast quantity of valuable data being effectively trapped in paper documents. Recent developments in transferring these to digital media, particularly using PDF format and placing them on the web, have increased overall access to publications dramatically but not significantly improved access and re-purposing data. Although simple searches of single or multiple documents may lead to the user finding the search terms in context, this context may not be what the user sought for, or if the search is successful, the information sought (e.g. taxon treatments, specimen data) are not retrieved in a format suitable for repurposing (such as analysis of specimen data). To allow more precise searching for prioritised components of publications and retrieval of data in a format that is repurposable, taxonomic papers are being marked-up in XML (Extensible Markup Language) and interfaces for queries being developed. A mark-up language is a set of words and symbols for describing the identity of pieces of a document, e.g. 'this is a paragraph', 'this is a list'. however, the mark up is not restricted to document attributes, it can cover data attributes to, e.g. 'this word is a taxonomic name', 'these words are the name of the collector af the species previously described'. Liberating the latter data from the document enables researchers to make more effective use of the document far more easily in their research.

The GoldenGATE editor has been developed to assist in the mark-up of OCR-ed (Optical Character Recognition) biosystematics documents into XML content.

GoldenGATE editor

GoldenGATE editor: basic layout of opening window. 1. general task bar. 2. indicator of open files. 3. editing functions. 4. customised functions following the mark-up process. 5. text window. 6. used tags. 7. position of editing indicator.

original publication

PDF format
XML tagged
Taxonomic description, same portion of text in:
A. original publication; B. XML-tagged format; C. PDF format


One issue with the original GoldenGATE is that it is a monolithic, desktop application, and demands a certain workflow from its users. A significant development in ViBRANT has been to decompose the individual components of GoldenGATE into discrete web services. Now users can access GoldenGATE through their web browser, and can select which features are appropriate to their workflow. GoldenGATE web services are accessible both online and programmatically through an API. Owing to the potentially long processing times required to identify and extract significant data from a text document, GoldenGATE web services are accessible through OBOE.

OBOE running DateTagger service

Screenshot showing the GoldenGATE DateTaggerNormalizing web service, which marks up dates in a document, being run in OBOE.

We continue to integrate GoldenGATE web services into OBOE, and through OBOE into Scratchpads, to ensure that not only do we meet current use cases for the web services, but that the web services are sufficiently flexible to be easily incorporated into new use cases and their associated workflows.

As part of the continuing enhancement of GoldenGATE's flexibility we are investigating mark up output options. As originally developed, GoldenGATE produces output in TaxonX, an XML schema focused on taxon treatments and suitable for one style of taxonomists workflow. However, as noted above, there are other XML schemas, each with their own strengths and weaknesses as explored by Penev et al in XML schemas and mark-up practices of taxonomic literature. The architecture of GoldenGATE is such that we can produce output using other schemas such as taXMLit, especially useful for archival of documents, or TaxPub, especially useful for incorporation into digital document production workflows.

An extension of our work with mark up options is to explore the use of linked open data (LOD). Whereas XML mark up is incorporated into the original document, LOD records are new artefacts capturing the additional data embedded in the original document. Developments with LOD since the inception of ViBRANT mean that biodiversity records can be more easily shared and queried than XML mark up can, and allow us to mobilise the extracted data in an effective manner. To achieve this we will collaborate with WP4 and exploit their work with ontologies to provide a consistent structure to the mark up.