Integrating Digital Papyrology(IDP) is a series of projects funded by the Andrew W. Mellon Foundation. It represents the integration of three longstanding digital papyrology efforts: the Duke Databank of Documentary Papyri (DDbDP), the Heidelberger Gesamtverzeichnis der griechischen Papyrusurkunden Ägyptens (HGV), and the Advanced Papyrological Information System (APIS), and the migration to a single format, the EpiDoc recommendations for the application of TEI XML to ancient documentary texts. (link) ; see also Cayless (2009)
This panel aims first to present an historical overview of the transformation of the DDbDP from a digital index of print scholarship to a community-managed resource for peer-reviewed scholarly control of core disciplinary assets, and then to suggest some ways in which we hope that the new suite of tools that we have been building around the data may help to transform scholarship in this important domain.
The DDbDP began in 1983 as a collaboration between Duke University and David R. Packard. Greek and Latin were encoded in Beta Code and searchable on CD using custom software. In 1996/7 the DDbDP migrated its authoritative version to the web-based Perseus Project, encoded in Beta Code and TEI SGML. In 2004/5 the DDbDP and HGV began mapping their theretofore discrete but complementary sets of text and descriptive metadata, while Duke obtained a planning grant from the Mellon Foundation. Papyrologists, IT specialists, librarians and university administrators came together to map out a sustainable future for the DDbDP. The way forward was clear: open source, standards-based development, greater collaboration, increased vesting of data-control in the user community, and greater interoperability with other projects.
With clear objectives and generous support from the Mellon Foundation, Duke launched IDP1 (2007/8). Its goals were to migrate the DDbDP from SGML to EpiDoc XML, and from Betacode Greek to Unicode Greek; to merge DDbDP texts and HGV metadata and translations into a single stream; to map these texts to corresponding APIS records, including images; to enhance the Papyrological Navigator (PN—see http://papyri.info) to enable search of the newly merged data. The fruits of these efforts were released under open access provisions in October 2008 (all content under CC BY and software under GNU GPL). In October 2008 the team began work on IDP2, again with support from Mellon. At the same time, APIS, under a grant from the National Endowment for the Humanities, began work on enhancing the PN, and the two projects joined forces. The results of IDP2 are (1) improved usability of the PN search interface on the merged and mapped data from the DDbDP, HGV and APIS, (2) facilitated third-party use of the data and tools, and (3) a version controlled, transparent and fully audited, multi-author, web-based, real-time, tagless editing environment (SoSOL), which—in tandem with a new editorial infrastructure—will allow the entire community of papyrologists to take control of the process of populating these communal assets. The discipline is now on the cusp of having the the entire life-cycle of the papyrological discipline represented online in a transparent, open, peer-reviewed and community-driven environment.
Cayless, Hugh 2009 “Epigraphy in 2017, ” Digital Humanities Quarterly, 3.1 (link)
When the DDbDP began in 1983 at Duke University, Oates (1993)Greek and Latin were encoded in Beta Code (an ASCII representation of text in different scripts and of various sigla and structural features, also used by the TLG,Nicholas (1999)that combines language encoding and some semantic markup features) and searchable on CD-ROM via a dedicated platform provided by the Packard Humanities Institute. These texts were entered manually by students at Duke, following a data entry manual, and using published editions of papyrological texts as the basis. When the DDbDP migrated from the CD-ROMs to the Web-based Perseus Project, the texts were machine-translated to a format mixing Beta Code and TEI P3 SGML (and the resultant hybrid encoding formed the basis of an updated data entry manual). Sosin (2010).
The first stage of the Mellon-funded Integrating Digital Papyrology project (2007-2008) involved the conversion of the DDbDP from largely regular and sometimes consistent format, but highly varied contents, into Unicode and EpiDoc XML. At this point the hybrid encoding scheme was a mixture of Beta Code (varying slightly over thirteen years of evolution and multiple generations of graduate students), machine-tagged TEI SGML (and hand-entered since the Perseus conversion). This basic consistency, compromised but not destroyed by this technical history of the project, is further complicated by the wide variety of papyrological material, including editions published over a century by editors with differing standards of editorial detail, and even occasionally inconsistent uses of conventions. These variations were somewhat, but not entirely, flattened by data entry practice.
Due to the size of the dataset, some 55,000 records, the conversion of these texts into both Unicode character encoding and EpiDoc-conformant TEI XML structural and semantic features had to be almost entirely automated, with only a small amount of human intervention for difficult or unique cases. This first round of conversion work was carried out by a team at the Centre for Computing in the Humanities at King’s College London. The tools to convert from the complex, legacy formats to the more sustainable, open standard TEI XML were open source and based on ongoing work in the EpiDoc community.
The CCH team built the data conversion as a four-step pipeline: (1) the SGML was turned into validating XML, and entities resolved, using a tool based on OSX; OSX, OpenJade Distribution, available: (link) .(2) the Beta Code representation of Greek and Coptic characters and certain symbols was converted to Unicode using Transcoder; Hugh Cayless, Transcoder, available: (link) .(3) a new set of regular expressions was added to CHETC, Hugh Cayless, Chapel Hill Electronic Text Converter, available: (link) .a regular expression-based conversion tool to convert much of the structural Beta Code to XML; (4) XSLT was applied to transform the XML produced by steps 1 and 3 into validating EpiDoc. In order to write these steps simultaneously, and iteratively to improve the tools, the process was pipelined so that the master copy of the texts remained the Beta Code/SGML hybrid, which continued to be manually improved. Only at the end of the process, when the output validated to the EpiDoc DTD, would the transformation be complete, and hand-fixes begin to be performed on this new version of the DDbDP.
In the second phase of the project, work on the DDbDP focussed on ongoing hand-fixes to the now canonical XML to improve the consistency of the text. This became ever more essential as the SoSOL tool (see Baumann, below) requires very consistent, valid EpiDoc. The EpiDoc recommendations were also updated to TEI P5 in this phase, and both Duke texts and HGV metadata encoded in this new format. The third phase (awaiting funding decision) will work to make the metadata translation into EpiDoc stable, rather than a pipelined crosswalk as it is currently.
Among the important lessons we learned from the DDbDP translation from the hybrid format to standarized TEI XML are the following technical points (all of value not only to our ongoing work with HGV metadata, but potentially to others contemplating large-scale up-coding from legacy formats):
Nicholasm, Nick et al. 1999-2000 The TLG Beta Code Manual,, (link)
Oates, John 1993 “The Duke Databank of Documentary Papyri, ” Accessing Antiquity: The Computerization of Classical Studies, Jon Solomon Arizona 62-72
Sosin, Joshua 2010 “Digital Papyrology, ” Congress of the International Association of Papyrologists, Geneva, 19 August 2010 (link)
One of the major difficulties encountered during IDP1 was the problem of aggregating related content. The main purpose of the Papyrological Navigator (PN) is to present a merged, searchable view of the DDbDP, HGV, and APIS datasets. IDP1 attempted to produce an aggregated dataset, where related DDbDP and HGV records were merged into a single EpiDoc XML document. The mapping processes whereby this was accomplished were flawed, however, and resulted in both spurious and missing relationships. In addition, combining documents from projects with different update schedules and sometimes different interpretations of what constitutes a document was problematic. For IDP2, it was clear that the mapping process would have to be re-imagined. The new process, described in this paper, uses a combination of semantic web technology and standard tools to produce a merged collection.
IDP draws on datasets from the DDbDP, HGV, APIS, and Trismegistos See (link) . The varied origins of these projects meant that different decisions were made in each about how to record information about source documents. For example, if several pieces of papyrus contain text that are part of the same
document,then DDbDP will tend to transcribe the whole, marking the separate sheets of papyrus with <div> elements. HGV on the other hand, tends to treat each piece as an entity, each with its own record. APIS, since it is essentially a unified export of the catalogs of museums and libraries with papyrus holdings, treats the document the way the host collection does. While APIS is organized according to the source collection, DDbDP and HGV use the publication history of the papyrus as an organizing principle. HGV has the idea of a
principal edition,i.e. the best published version of the papyrus. DDbDP is similarly based on a single published edition, though HGV and DDbDP may not agree on what is the principal edition. When a better publication comes out for an existing document, DDbDP replaces the existing document with the new version, leaves a
stubrecord with a pointer to the new version in place of the old, and the new edition points back at the old one. HGV, similarly, sometimes retires identifiers. But these updates are not always efficiently propagated across projects, so references may become stale over time.
DDbDP identifiers are constructed from short-form citations of the document edition on which the transcription was based, so, for example bgu;1;2 is the identifier for the 2nd item in the first volume in the BGU (
Aegyptische Urkunden aus den Königlichen (later Staatlichen) Museen zu Berlin, Griechische Urkunden) series. Trismegistos assigns numeric identifiers to individual documents (e.g. 1234), and HGV bases its identifiers on these. When HGV data is finer-grained than TM, it will append alphabetic characters to the TM id (e.g. 1234a, 1234b, etc.). APIS bases its identifiers on the collection name plus a number, e.g. berkeley.apis.15.
DDbDP and HGV began the process of reconciling their collections in 2004, so where there are correspondences, the documents include the id numbers of related records in their document headers. For example, http://papyri.info/ddbdp/p.oxy;4;744/source encodes identifiers both for itself and for the related HGV record in its publicationStmt:
Likewise, (link) , the related HGV record, contains the DDbDP identifier in its header:
relation(which is a term drawn from the Dublin Core Metadata Initiative Terms DCMI Metadata Terms, available: (link) .list) http://papyri.info/hgv/20442/source, and the latter has a reciprocal relationship to the former.
The question of aggregation becomes much more complicated when APIS enters the equation. Because of its origin as a union catalog (the data for which was contributed by a variety of insitutions), the quality of APIS’s information is more uneven. The method used by the first version of the PN to establish relationships with DDbDP, namely looking for matching citations, produced results that were sometimes inaccurate. For IDP2, we looked for pieces of information that could be extracted (like the <idno>s in DDbDP and HGV) and represented as RDF triples. HGV contains URLs for images of papyri and ostraka when they are available, and when these are hosted at Columbia (the original home of APIS), they contain the same APIS id as the APIS documents do. In addition, APIS records sometimes include references to the relevant DDbDP document in their bibliography: <bibl type="ddbdp">P.Oxy.:4:744</bibl> This does not use the DDbDP identifier scheme, but it can be converted to the DDbDP identifier, p.oxy;4;744 easily enough. APIS contains some spurious references of this type (to collections that don’t exist, for example), but all that is needed to check their validity is to check for the existence of a corresponding DDbDP file.
In addition to the collections mentioned above, there are also HGV translations, which use the same numbering scheme as the HGV metadata, and APIS images, which are represented by a static RDF file mapping image names to APIS ids. A representation of the incomplete graph that can be constructed by extracting these relations from the source documents looks something like:
So there are locations in an IDP EpiDoc document (and in associated artifacts) where we can check for reference information, and from these we can build a (partial) graph of relationships between documents in the three collections. The database import method is simple. A program crawls each directory in our repository containing EpiDoc documents, runs each through an XSLT transformation which converts it to RDF XML, and inserts that RDF into a Mulgara# triple store. Once the relations have been converted to RDF, we can do some simple inferencing to
fill outthe rest of the incomplete relation graph. In practice, this means running SPARQL queries that produce results like
A is related to B, because B is related to Aand
A is related to C, because A is related to B and B is related to C,and then inserting the results of those queries back into the triple store. The resulting graph looks more like:
Besides relation data, we can also extract hierarchical and bibliographical information from the collection. Because DDbDP’s identifiers are hierarchical and carry meaning, they can be decomposed and used to identify not just the items, but their containers, the volumes and series in which they were published. APIS ids can be decomposed to collection and item. HGV and TM ids are opaque, but HGV’s principal edition bibliographic record can be used to construct a hierarchy for that collection. HGV’s bibliography also points to the primary source for its record.
The APIS, DDbDP, and HGV collections are hosted in the Papyrological Navigator, while TM has its own website. This means that we are able to mint URLs using the APIS, DDbDB, and HGV identifiers, so http://papyri.info/ddbdp/bgu;1;2/source represents the EpiDoc XML document that transcribes BGU 1 2, http://papyri.info/ddbdp/bgu;1;2 retrieves an HTML document that aggregates the DDbDP transcription with metadata from HGV #8961 and links out to TM #8961, http://papyri.info/ddbdp/bgu;1;2/rdf is shorthand for an RDF query that pulls together a subgraph of triples with http://papyri.info/ddbdp/bgu;1;2/source as the subject, and so on.
The primary internal uses of the RDF triple store in papyri.info are as the driver for generating and indexing the site, and as a means for SoSOL (see Ryan Baumann’s paper) to reference valid identifiers. The method for generating the site uses the triple store to load, first, all of the identifiers associated with DDbDP using the hierarchical relations, then all of the HGV ids not linked to a DDbDP identifier, then all of the unrelated APIS identifiers. The collections are then transformed to HTML, and the files corresponding to file identifiers (along with any related EpiDoc documents) are transformed both to HTML and to Solr add documents, which are then ingested into the search engine. In this fashion, we create the entire papyri.info site, with correctly aggregated data, using a few basic tools.
Developments slated for the near future include using the triple store as a container for relations to external projects. An effort was made during the summer of 2010 to link HGV place names to Pleiades and Trismegistos Places URLs, and once the data has been processed, it will be loaded into our triple store and used as the basis for linking to data at these sites. We hope, for example, to be able to display maps drawn from Pleiades, as well as linking to the site. Another planned use for the triple store is to have it handle the normalization and display of bibliography for papyri.info.
The method outlined above for extracting data from source documents, storing it as RDF triples, and then using inferencing to fill out gaps in the graph has potential applications well beyond papyri.info. For example, it could be used to manage relationships between EAD finding aids and digitized archives, or to merge data from other overlapping collections. Any data management situation where joins between discrete but related objects as desired, and where it is possible to extract a partial relationship graph from the sources could profit from using semantic web tools in this fashion.
The Son of Suda On Line We regard SoSOL as the intellectual heir of the Suda Online project, available: (link) .(SoSOL) is one of the main components of the Integrating Digital Papyrology project (IDP), aiming to provide a repurposable web-based editor for the digital resources in the DDbDP and HGV. SoSOL integrates a number of technologies to provide a truly next-generation online editing environment. Using JRuby JRuby, available: (link) .with the Rails Ruby on Rails, available: (link) .web framework, it is able to take advantage of Rails’s wide support in the web development community, as well as Java’s excellent XML libraries and support. This includes the use of XSugar XSugar provides a means of converting between equivalent XML and non-XML vocabularies. See XSugar, available: (link) .to define an alternate, lightweight syntax for EpiDoc XML markup, called Leiden+. Because XSugar uses a single grammar to define both syntaxes in a reversible and bidirectional manner, this is ideal for reducing side effects of transforming text in our version-controlled system. SoSOL uses the Git Git, available: (link) .distributed version control system as its versioning backend, allowing it to use the powerful branching and merging strategies it provides, and enabling fully-auditable version control. SoSOL also provides for editorial control of changes to the main data repository, enabling the democracy of allowing anyone to change anything they choose while preserving the academic integrity of canonical published data. This talk will provide a demonstration of these features of SoSOL as implemented for IDP2, as well as a discussion of its repurposable design for applicability to other projects and the ongoing documentation work being done to increase usability and adoption in the wider community.
Many online editing environments, such as MediaWiki, use an SQL database as the sole mechanism for storing revisions. This can lead to a number of problems, such as scaling (most SQL servers are not performance optimized for large text fields) and distribution of data (see for example the database downloads of the English Wikipedia, which have been notoriously problematic for obtaining the full revision history). Most importantly, they typically impose a centralized, linear, single-branch version history. Because Git is a distributed version control system, it does not impose any centralized workflow. As a result, branching and merging have been given high priority in its development, allowing for much more concurrent editing activity while minimizing the difficulty of merging changes. SoSOL’s use of Git is to have one
canonicalGit repository for public, approved data and to which commits are restricted. Users and boards each get their own Git repositories which act as forks of the canonical repository. This allows them to freely make changes to their repository while preserving the version history as needed when these changes are merged back into the canonical repository. These repositories can also be easily mirrored, downloaded, and worked with offline and outside of SoSOL due to the distributed nature of Git. See Baumann (2011).This enables a true democracy of data, wherein institutions still retain control and approval of the data which they put their names on, but any individual may easily obtain the full dataset and revision history to edit, contribute to, and republish under the terms of license.
While XML encoding has many advantages, users inexperienced with its use may find its syntax difficult or verbose. It is still desirable to harness the expertise of these users in other areas and ease their ability to add content to the system, while retaining the semantically explicit nature of XML markup. To do this, we have used XSugar to allow the definition of a
taglesssyntax for EpiDoc XML, which resembles that of the traditional printed Leiden conventions for epigraphic and papyrological texts where possible. Structures which are semantically ambiguous or undefined in Leiden but available in EpiDoc (e.g. markup of numbers and their corresponding value) have been given additional text markup, referred to comprehensively as Leiden+. XSugar enables the definition of this syntax in a single, bidirectional grammar file which defines all components of both Leiden+ and EpiDoc XML as correspondences, which can be statically checked for reversibility and validity. This provides much more rigorous guarantees of these properties than alternatives such as using separate XSLT stylesheets for each direction of the transform, as well as encoding the relation between the components of each syntax in a single location.
As an example, we might have the following XML fragment:
As you can see, things such as the Greek letter
κon line four being the number
20are implicit in print, but explicit in both EpiDoc and Leiden+. The user can work on either the Leiden+ or XML representation of the text, and we store only the XML representation in our data repository (that is, the Leiden+ representation is only an intermediate form used for editing and is transformed back to XML when saved). A traditional Leiden
print previewis possible by applying the standard EpiDoc XSLT stylesheets to this XML. In theory, this particular XSugar grammar could be re-used by other EpiDoc projects wishing to enable the same kind of alternative markup.
Due to institutional requirements, the DDbDP and HGV datasets needed separate editorial control and publishing mechanisms. In addition, their control over different types of content necessitated different editing mechanisms for each component. These requirements informed the design of how SoSOL interacts with data under its control and how this design is repurposable for use in other projects. The two high-level abstractions of data made by SoSOL are
identifiers. Identifiers are unique strings which can be mapped to a specific file path in the repository, while publications are arbitrary aggregations of identifiers. By defining an identifier superclass which defines common functionality for interacting with the data repository, we can then subclass this to provide functionality specific to a given category of data. The SoSOL implementation for IDP2, for example, provides identifier subclasses for DDbDP transcriptions, HGV metadata, and HGV translations. Editorial boards consequently have editorial control for only certain subclasses of identifiers. Publications in turn allow representation and aggregation of the complex many-to-many relationships these components can have (for example, a document with two sides that may have one transcription and two metadata components). Packaging these related elements together both allows the user to switch between them and editorial boards to check related data which they may not have editorial control over but still require to make informed decisions about validity and approval. SoSOL can thus be integrated into other systems by implementing the identifier subclasses necessary for the given dataset as well as coherent means for aggregating these components into publications. One can imagine the simplest implementation as being an identifier whose name is the file path, which just presents the plaintext contents of the file for editing, and which has no relationships with other identifiers so that each publication is a single identifier.
Though SoSOL is still under development, early reception and feedback at training workshops conducted to introduce users to the system has been good. Thousands of texts have been created, edited, or corrected, and submitted through the editorial boards for voting, peer review, and publication. In addition to being publicly viewable through the Papyrological Navigator, these changes are regularly mirrored to a public copy of the Git data repository available on GitHub, where anyone may view and download the complete revision history (see previous footnote). Under IDP3, the currently-active final phase of the IDP project, user experience studies are being conducted and feedback from this process will be incorporated into the system.
In addition to the possible reuse of the system itself, we hope the more general goal of making data available with the full revision history will become a more widely adopted practice in digital humanities. Particularly for collaborative editing projects, this can allow anyone to see how changes have actually been effected, even outside of the specific editing environment employed. We feel that using a distributed version control system such as Git as the core data backend is conducive to this goal, as it enables easy distribution and updating of the data set alongside its complete version history. Thus, even if our particular editing environment is eventually outdated and replaced, or the data needs to be interacted with and edited using other mechanisms, the data backend can still be used for the next generation of tools and scholars.
Ryan Baumann IDP Data available on GitHub, 3 March, 2011 (link)