The TEI files for this site use the <sourceDoc>
based schema from
the Text Encoding Initiative, rather than the more common default
<text>
based schema. So my code for a single stanza of a poem
will look like this:
<surfaceGrp xml:id=“f.75” n=“folio”>
<surface n=“recto”>
<label>Folio 75 Recto</label>
<graphic url=“St_John_56_75r.jpg”/>
<zone n=“EETS.QD.16”>
<line n=“l.1”> He myght be called / eleazar the secunde ؛ </line>
<line n=“l.2”> The chamnpyoun / moost myghty and notable ؛ </line>
<line n=“l.3”> That yaff the olyfaunt / hy laste his laste mortall wounde ؛ </line>
<line n=“l.4”> Machabeo<ex>rum</ex> / this story ys no fablee ؛ </line>
<line n=“l.5”> And hercules / in his conquest stable ؛ </line>
<line n=“l.6”> Bar vp the heuenys / in his humanytee ؛ </line>
<line n=“l.7”> Ffor whom ony sorowes / wer maad moost lamentable ؛ </line>
<line n=“l.8”> Whan I be hylde hym / þus nayled to atree ؛ </line>
</zone>
[…]
</surface>
</surfaceGrp>
One of my concerns about markup of text is that it can get so busy describing the
text so machines can read it that it ceases to be readable by people. So I have
tried to have the minimum amount of markup necessary to fit the rules of TEI while
recording the structure and features of the poem as it exists in the witness. That’s
what the @xml:id
and @n
attributes are doing above.
They’re letting the machine know that this is folio 75 of the book in question. The
<surfaceGrp>
is a single page of the book, front and back.
The <surface>
is the particular side of the page in
question—recto or verso. From there, we label the particular surface in a
people-friendly way that’ll display on the site and provide the link to the
image.
This begs the question of why <zone>
and
<line>
are necessary. <line>
, I suspect, is
relatively self-evident, and <zone>
is the actual stanza in
question. Unlike some poems that have extensive critical histories, there’s not a
canonical numbering system for many of these poems, so what I’ve done here is
provide a two-part system. The first part, the zone, is the actual stanza in
question based on what’s considered the critical edition of the work – in this case
The Minor Poems of John Lydgate (Early English Text
Society early series 107). That gets abbreviated to EETS e.s. 107, and which is
where I get the EETS in the @n
attribute under
<zone>
. The “QD” refers to the the initials of the title of
the poem as it appears in that book, and the number is the number of the stanza in
the poem. So what the @n
attribute in <zone>
is
doing is explaining where the actual stanza is in relation to the poem in the EETS
volume. That doesn’t mean it’s necessarily where it is in the actual book the
witness is taken from. That’s handled by the folio reference in the
<surfaceGrp>
element. Also, once I have finalized images for
all the texts, I’ll include code to give the dimensions of the zone on the image,
so
it can be highlighted, but that’s for the future.
<line> functions in much the same way. There are eight lines per zone, just as
there are eight lines per stanza in this poem, so the @n
there refers
to the particular line of the eight – again using the EETS edition as a signpost
(it’s “l.x” rather than just “x” because of a limitation of TEI – you can’t have
solely numeric @n
attributes in the <line>
element).
The underlying structure mentioned above is what makes the line comparison
relatively simple. Because of the @n
attributes identifying both the
verse and the line in question, a
script can be written using XQuery that will easily grab
the appropriate analogues whenever they exist:
let $q:=collection(‘file:/users/matt/Documents/tei/Lydgate/Quis_Dabit?select=*.xml’)
for $y in $q
let $s := $y//tei:surface
let $t := $y//tei:titleStmt/@xml:id
let $m := $y//tei:msDesc/@xml:id
let $z := $s/tei:zone[@n=“EETS.QD.4”]
let $l := $z/tei:line[@n=“l.1”]
let $w := concat($y//tei:msDesc/tei:msIdentifier/tei:settlement/text(),',',$y//tei:msDesc/tei:msIdentifier/tei:institution/text(),' ',$y//tei:msDesc/tei:msIdentifier/tei:idno/text())
let $g := concat($t, “/” ,$m, “/”, substring-before($l/../../tei:graphic/@url,“.”),“.html”)
let $o := local:remove-elements($l, $remove-list)
where ($z//tei:line/@n = “l.1”)
return
<item>
<ref target=“{$g}”>{$o}</ref>
</item>
I realize that to the uninitiated this may appear to be gibberish, but it's actually quite simple:
let $q:=collection('file:/users/matt/Documents/tei/Lydgate/Quis_Dabit?select=*.xml’)
This is a variable that invokes XQuery’s collection function. In this case, it is pointing to a folder on my desktop, but in the live version it points to the folder where the xml for particular texts are located. The *.xml at the end tells it to grab everything with the filename extension “xml” in that folder.
collection()
basically puts all the documents together one after the
other, so that what was a series of small seperate tree structures now has an
overarching root connecting them. I need to be able to walk through that root to
grab the individual items.
for $y in $q
lets me do that. The code is stating that for each of the
items($y
) that have been connected in the collection
($q
), return some information. That information is identified via
the series of $let
declarations. These mean exactly what they sound
like: let whatever variable ($x
$y
, etc.) = whatever is after the := symbol. So in this case
$s
invokes all the surfaces in question, $w
grabs the
holding institution and shelfmark of the volume by combining a number of elements
in
the TEI, $g
grabs the url information from the graphic and generates a
hyperlink so that the result can link back to the original item (this time by
combining the element with static text), $z
is the zone information
with the limitation of a particular stanza, and $l
is the particular
line. $o
is the actual text from the particular witnesses, run through
another function that I will explain the reason for shortly.
Once all the information is defined via the let
statements, the text
needs to be filtered from the entire poem to the single line the viewer wishes to
compare. This is handed by the where
clause, which is saying that out
of all the <line>
s in the <zone>
(which is
already constrained by the @n
attribute) we want only the information
for line l.1.
Before getting to what is actually returned by this code, a moment to talk about what
the local:remove-elements
above means. XQuery is what is called a functional
programming language. While the link will go into depth on what precisely
that means, for practical purposes a function is a piece of code written once,
encapsulating multiple lines of code into a single reference that can be called to
again and again. This can be useful for a number of reasons that you can find more
information about here,
but in the case of this project the reason the function was written was primarily
for recursion. The full code of the local:remove-elements
function is
as follows:
declare function local:remove-elements($input as element(), $remove-names as xs:string*) as element() {
element {node-name($input) }
{$input/@*,
for $child in $input/node()[not(name(.)=$remove-names)]
return
if ($child instance of element())
then local:remove-elements($child, $remove-names)
else $child
}
};
The function is first defined through the declare
statement. This lets
the engine that's running the xquery know that this is a function rather than a
piece of code to be run immediately. the local
that prefaces the
remove-elements
references the local namespace, while
remove-elements is the name of the function. You'll note that the function takes two
variables (the items in paranthesis): an XML element and a string.
The xs
prefacing string is another reference to a namespace -- in this
case the namespace for the XML data model.
So far, all of this has simply been defining the function. The actual code generated
begins on the next line. element
, here, is what is referred to as a
computed constructor. All this means is that it's creating the element to
be returned computationally, rather than through a direct declaration of the
element. node-name($input)
lets the Saxon engine know that we're only
interested in the name of the element at this point, rather than the element and its
contents. Now that the new element is declared, the block of code in the curly
brackets is executed. $input/@*
says to grab everything in the old
element, go through each child of that element and return it as a child element
except for those items including in the list above
(for $child in $input/node()[not(name(.)=$remove-names)]
). The
comma simply serves as a way to indicate what is the input and what's to be
executed.
After it's gone through all the child elements in the orignal inputted element, the
return
statement indicates that the children are to be added to the
newly created element, but there's a condition in place. The if/then statement
serves as a check to make sure that there are no children of those child
elements. if ($child instance of element())
tells the Saxon engine
to check if there are children of this child node. then
local:remove-elements($child, $remove-names)
tells it to run the function
again for that child if there are. This is the primary reason this bit of code needs
to be written as a function—so it can recurse through the various children of an
element, catch them all, and apply itself to each until it reaches a child element
that has no children in turn. Once it's done that, it attaches that child element
to
the newly-created element and the whole package is returned to us. This is useful,
for example, if I have multiple note
elements attached to a single
line
element, as the code will go through and remove each of them
in turn.
Much like with the function, the information after return
in the larger
code sample shows the format of what this code will spit out: a set of tei-formatted
lines, each stating the particular book in question’s location and shelfmark, a
hyperlink back to the original page the line can be found on, and the actual text
in
question. Altogether, running it will look like this:
<item> London, British Library Harley 2251:
<ref target=“.html”>O alle ye doughtres · of Jerusalem</ref>
</item>
<item> London, British Library Harley 2255:
<ref target=“.html”> <hi rend=“blue_pilcrow”>¶</hi>O alle ye douħtren of <hi rend=“underline”>ierusaleem</hi></ref>
</item>
<item> Long Melford, Holy Trinity Church Clopton Chantry Chapel:
<ref target=“.html”> <hi>O</hi> alle ye <gap quantity=“8” unit=“chars” reason=“illegible”/>s of ierusaleem</ref>
</item>
<item> Cambridge, Jesus College Q.G.8:
<ref target=“.html”> <hi>A</hi>ll the <hi rend=“underline”>doughtren </hi>of <hi rend=“underline”> Ierusalem</hi> . </ref>
</item>
<item> Oxford, Bodleian Library Laud 683:
<ref target=“.html”>O alle ẏe douhtren of jerusaleem</ref>
</item>
<item> Oxford, St. John’s College 56:
<ref target=“.html”>O alle the doughtren / of Jerusalem ؛</ref>
</item>
Which is ready to be styled by XSLT as soon as it’s either embedded in an existing page or has the rest of the TEI wrapper built around it.
The actual running of this XQuery has to be done by the Saxon XSLT/XQuery processor. That process needs to be called either on the server or locally on the viewer's machine. Since the viewer may not have the Saxon processor installed, it occurs on the server using a piece of PHP code. Unfortunately, the necessary files to connect that PHP code with the Saxon installation natively are not available due to a corrupt installation file on the Saxon site. A command line version of the program is available, however, and runs with the following command:
java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq
This means that a call to the external program from a php page has to be made, requiring this piece of code:
$text = exec (“java -cp saxon9he.jar net.sf.saxon.Query -t -q:test.xq -line=$line zone=$zone collection=file:$collection”);
This works fine, and the results are stored as $text
. But they’re still
formatted as the xml string shown above, not as html that can be understood by the
web without a lot of extra work. What needs to happen to make it easily readable by
a web server is that it needs to be styled either with XQuery or with XSLT. Of the
two, XSLT makes a whole lot more sense – that’s what it’s designed for, whereas
XQuery is really designed as a query langauge to use xml files as a database.
PHP has its own XSLT parser, which can be invoked like this:
$xml = new DOMDocument;
$xml->loadXML($text);
$xsl = new DOMDocument;
$xsl->load(‘comparison.xsl’); // Configure the transformer
$proc = new XSLTProcessor;
$proc->importStyleSheet($xsl); // attach the xsl rules
echo $proc->transformToXML($xml);
What this does is create a new object within php, load it with the xml we just finished creating, load a stylesheet to style said xml, and then attach the stylesheet to the xml. Finally, the result of the transformation is returned to the screen via the echo command. It works really well in most cases. The problem is that my xsl stylesheet has this piece of code in it:
<xsl:variable name=“max” select=“@quantity”/>
<xsl:for-each select=“1 to $max”>
<xsl:text>.</xsl:text>
</xsl:for-each>
That code makes sure that any time there’s a gap due to damage, the likely number
of
characters (taken either from what’s left of the letters or from the critical text,
EETS e.s. 107) is reduced to a number of dots, indicating that it’s not just a blank
space. The way I do that is through an xml attribute called @quantity
and a for-each
loop that prints dots until the the system's internal
counter matches @quantity
. That functionality with a
for-each
loop is an XSLT 2.0 bit of code, since philosophically
XSLT generally eschews such loops in favor of their native
<xsl:apply-templates>
function. The native php XSLT parser is
1.0. It will not handle this code.
But!
Good old Saxon will handle XSLT 2.0, but we have no php-native XSLT parser for Saxon. So a second external call to the command line is made:
$transform = exec (“java -jar saxon9he.jar -s:$filename -xsl:comparison.xsl -o:$html”);
Notice, though, that that command has a $filename
variable. The parser
won’t easily just take the string we had before, so now instead of keeping the
result in memory it needs to be written to a file, which is then read by the Saxon
XSLT 2.0 parser in the command above. Once it does so it transforms the xml into
html, which should be able to be displayed via a php echo
statement.
However, that doesn’t work, so it’s written to another file named by the variable
$html
.
Now, to actually display the information, we need to go back to the command line and grab that html file:
$test = file_get_contents($html);
and then display the results:
echo $test;
This is what you actually see when you click on the blue dot to the right of the line
and the box opens up – the contents of $test
. It’s not as simple as
just clicking on the dot, though. Clicking on that dot calls some javascript.
Javascript has to be used because you’re changing something locally when you click
on the dot and javascript is a client-side scripting language.
The javascript passes the line, zone (here represented as id
), and
collection characteristics to the php code via this function:
function compare_toggle_visibility(id, line, collection) {
var e = document.getElementById(id);
e.style.display = ((e.style.display!='none’) ? 'none’ : 'block’ );
$(e).html(“Loading Comparison…”);
$.get(’/XML/XQuery/test_command_line.php’+ ’?collection=’ + collection + ’&zone=’ + id + ’&line=’ + line,
function(responseTxt){$(e).html(responseTxt); });
}
What this does is first check to see which element in the html code has the
@id
attribute. It then checks to see if it has the style attribute
‘display:none
’ (indicating it should not be displayed) and switches
it to ‘display:block
’ if it does. That’s what allows the box to “open
up” and become visible. Having done that, it then puts some text into that box so
that you know that work is being done, and finally it loads the php page and sends
the results of that php page to the box. Clicking on the dot again will close the
box back up (and at this point re-runs the code pointlessly – that’s something I
need to fix).
On the php side, the three variables are passed to the php through a
[$_get]
statement (which grabs the appropriate value from the uri
passed to php)
$line=htmlspecialchars($_GET[“line”]);
$zone=htmlspecialchars($_GET[“zone”]);
$collection=htmlspecialchars($_GET[“collection”]);
and the code is processed as explained above. However, since there’s the possiblity
of multiple people accessing the same lines at similar times we can’t have filenames
that stay the same – they’d be overwritten. Instead, dynamic filenames need to be
created. I do this by generating a random number and attaching it to the machine’s
timestamp, then creating two variables based on that number with the extensions
.html
and .xml
.
$unique=microtime(true) . mt_rand(1,5000000000);
$filename=$unique . “.xml”; $html=$unique . “.html”;
After the code has executed, I then clean up these files so there isn’t a bunch of randomly named files cluttering up my machine:
unlink($filename);
unlink($html);
The unforunate effect of this constant call back and forth to the command line is a lag on the display of the comparison items, but my hope is that the Saxon php installer will be repaired and I can streamline it with the more integrated code.