Guide to working with ancient documents in XML

with 1 Comment

This is a guide written for my¬†colleague, A. If anyone else can get any use out of it – well that’s an added bonus ūüôā

A wants to work on her transcription in XML. She is validating her XML against the EpiDoc subset of TEI.

Structure and lines

The line structure of A’s text is quite a bit more complex than anything I have ever worked on before. It took quite some time and a bit of chocolate for us to figure out how best to encode it in TEI.

First off there is the physical lines. They are named after the fragments on which they occur. The fragments can and have been moved around to that now the text begins with the line 3B.10. 3B is the name of the fragment and 10 is the number of the line. As A works on the manuscript she wants to be able to convert these current physical lines into continuously numbered restored lines. So that 3B.10 becomes simply line 1. However, this may be years into the future. For the current physical lines I have suggested that these are marked up using the <lb> tag where @type=physical and n=3B.10.

A also wants to be able to divide the text into sections and subsections based on her interpretation. Here I suggested using the <ab> tags to surround each section.

A has her own way line-breaks which helps her to keep read the text better. This also needs to be marked so that she can share this as HTML with her collaborators. For this I suggested a plain old line-break tag with no types and ids: <lb/>.

A wants to mark where the recto and verso sides of the manuscript begins. For this I suggested the <pb> tag with @n=Recto.

Finally, A also wants to keep track of where the different fragments begin and end. For this I suggested the <milestone> tag with @unit=fragment and @n=3B.10.

I also suggested using the milestone tag with @unit=speaker and @n=Buddha, to note when the speaker changes.


For the transcription of her text A wishes to tag the following elements:

EpiDoc conversion by Eddie's Room


An uncertain reading is usually a reading of characters that are legible but doubtful. Sometimes these characters are damaged, but damage in itself is not enough to make the character uncertain. The project that A is working on has a somewhat different (from what I am used to) was of transcribing uncertain readings. Within the Leiden Conventions (the first conventions for document mark-up that I ever knew off) the usual way is to under-dot the characters and symbols that are uncertain. For this project however, this is enclosed by square brackets. The tag for this is simple <unclear>. So why is this such a big difference? Well, under-dots are simply something you add to each character, whereas the square brackets can surround a string of unconnected characters. This may be a bit of a change for A as EpiDoc only “can only contain CDATA and <g/>” (EpiDoc Mark-up List). So if she would like to indicate uncertainty for other interpretations she will have to make use of attributes such as precision=”low”.


Characters that are not visible on the document, but are instead supplied by the editor. In Leiden Conventions these are displayed in square brackets. However, for A’s project restorations are displayed in round brackets with a star at the beginning. In EpiDoc TEI this is encoded with the <supplied> tag and the attribute reason=”lost”.


Omissions are characters that were erroneously omitted by the scribe and later added by the editor. The Leiden Conventions suggest to display these omissions in angled brackets. A’s project also use angled brackets but adds a star at the beginning. EpiDoc TEI suggests to encode this with the <supplied> tag and the attribute reason=”omitted”.


These are characters added by the scribe (usually above the line) and the Leiden Conventions display this in single quotation marks. For A’s project these are marked with double angled brackets and can potentially be placed both above and below the line. EpiDoc TEI suggests to encode this with the <add> tag and the attribute place=‚ÄĚbelow/above/left/right‚ÄĚ.


Interpolation is described as characters superfluously added by the scribe and later deleted by the editor. Both the Leiden Conventions and A’s project display these within single curly brackets. In EpiDoc TEI this is encoded with the <surplus> tag alone.


These are characters that were deleted by the scribe, which sometimes have been reconstructed by the editor. This can be combined with added characters (see addition above). The Leiden Conventions display this in double square brackets and A’s project display this in double curly brackets. EpiDoc TEI suggests to encode this with the <del> tag. In cases where the scribe has added a correction it is possible to use the <subst> tag surrounding a <del> tag and an <add> tag (see the slideshow ‘EpiDoc substitutions, corrections, variants’).

Illegible or lost characters

Illegible characters are characters that are visible but have been left unread while lost characters are¬†completely¬†gone, often due to damage of the surface. In the Leiden Conventions illegible characters are marked as subscript dots and lost characters are marked as subscript dots inside square brackets. The number of dots denote the number of characters lost. A’s manuscripts use aksara (Wikipedia) instead of characters. For these she marks an illegible aksara with a question mark and a lost aksara with a plus sign. She also needs to mark lost parts of an aksara (written as a character). EpiDoc TEI suggests encoding lost or illegible characters with the tag <gap> and the reason=”lost” or reason=”illegible” together with unit=”character” and quantity=”[how ever many character the editor thinks there were]”. For A’s project I suggested using the attribute unit=”aksara” and for the parts of aksara using the attribute unit=”character”.


A lacuna is a gap in the manuscript (Wikipedia) and it usually indicated with square brackets in the Leiden Conventions. For A’s project she uses a triple forward slash symbol to indicate where text is lost at the edge of the support. In EpiDoc TEI this is encoded with the <gap> tag and the attributes reason=”lost” and extent=”unknown”. If she wants to A can add the attribute agent attribute (e.g. agent=”broken-off”) to indicate the reason behind the lacuna. This may be useful for future analysis of the document and the amount and nature of lost text.


One Response

  1. […] a look at the Guide to working with ancient documents in XML for ideas about […]