Postgres to TEI XML – Bibliography

with No Comments

Apologies in advance but this post is meant to document my attempts at transforming the current Postgres database contents into TEI XML format. I am going to post some notes in this post about how I get along. What issues I have along the way. The solutions I hopefully find, and so forth. If you are interested in this subject then please comment – especially if you have solutions to any issues I may have or ideas for neater ways of doing things. I will put the newest content first.

Tuesday, 5th February 2013

I now have one version of a transformation stylesheet transforming books from the bibliography table into TEI/EpiDoc XML. Here are some of the issues I have run into.

  1. Previously in this post I mentioned the sortKey and how it at first didn’t fit the regular expression. I extracted all spaces and at first it worked fine. I figured that the spaces were not essential – especially if I replaced them with an underscore instead. However, now I have come across an example of the sortKey (which btw is automatically  cleverly created from the surname, forename, date, script_code and first word of the title) in which for one particular case (so far) some Russian script (part of the forename) has snuck in. Russian script can’t be a part of the regular expression for sortKey apparently, so I am going to have to find a way of getting rid of this. The question is should it even be in the forename field?
  2. Another issue is the [script_code] field and whether it is correct for me to encode this as the attribute rend=”script(Latn)”. I put this question to the oracle of the EpiDoc Markup list. Gabriel Bodard was quick to answer that I should put both language and script_code together in the xml:lang attribute. While this looks nice I just feel that it will be a pain and will come back to haunt me several times over, when I need to separate language and script_code constantly for the publication. It’s possible to do, for sure! But it would be nice if there was another solution.
  3. The language names also need to be changed into the proper subtags. Thanks to Gabby for pointing me towards the list of appropriate tags: The IANA language subtag registry. The XSLT for changing the languages currently looks like this:
<xsl:if test="language"> 
    <xsl:attribute name="xml:lang">
            <xsl:when test="language='English'"><xsl:text>en</xsl:text></xsl:when>
            <xsl:when test="language='Chinese'"><xsl:text>zh</xsl:text></xsl:when>
            <xsl:when test="language='Czech'"><xsl:text>cs</xsl:text></xsl:when>
            <xsl:when test="language='Danish'"><xsl:text>da</xsl:text></xsl:when>
            <xsl:when test="language='Deutsch'"><xsl:text>de</xsl:text></xsl:when>
            <xsl:when test="language='Dutch'"><xsl:text>nl</xsl:text></xsl:when>
            <xsl:when test="language='French'"><xsl:text>fr</xsl:text></xsl:when>
            <xsl:when test="language='German'"><xsl:text>de</xsl:text></xsl:when>
            <xsl:when test="language='Greek'"><xsl:text>el</xsl:text></xsl:when>
            <xsl:when test="language='Hindi'"><xsl:text>hi</xsl:text></xsl:when>
            <xsl:when test="language='Italian'"><xsl:text>it</xsl:text></xsl:when>
            <xsl:when test="language='Japanese'"><xsl:text>ja</xsl:text></xsl:when>
            <xsl:when test="language='Khotan Saka'"><xsl:text>kho</xsl:text></xsl:when>
            <xsl:when test="language='Korean'"><xsl:text>ko</xsl:text></xsl:when>
            <xsl:when test="language='Pali'"><xsl:text>pi</xsl:text></xsl:when>
            <xsl:when test="language='Prakrit'"><xsl:text>pra</xsl:text></xsl:when>
            <xsl:when test="language='Russian'"><xsl:text>ru</xsl:text></xsl:when>
            <xsl:when test="language='Sanskrit'"><xsl:text>sa</xsl:text></xsl:when>
            <xsl:when test="language='Sanskrit'"><xsl:text>sa</xsl:text></xsl:when>
            <xsl:when test="language='Sinhalese'"><xsl:text>si</xsl:text></xsl:when>
            <xsl:when test="language='Tibetan'"><xsl:text>bo</xsl:text></xsl:when>

Wednesday, 30th January 2013

I am currently stuck on

sort -> @sortKey as this won’t validate a key such as: “Strong_john S._0_2004_01latnrelic”

value of attribute "sortKey" is invalid; must be a string matching the regular expression "(\p{L}|\p{N}|\p{P}|\p{S})+"

The regular expression should cover everything so I can’t quite see why it doesn’t cover the … Ah I just figured it out. What I couldn’t see in Oxygen because of the red “this does not validate” line was that there is a space between john and S and this is not covered in the regular expression. I’ll just have to remove this and I don’t think it will be a big issue as it won’t change the sort key as such. I think! (Edit: not if I input an underscore instead, it won’t).

Sorted with the following code:

<xsl:if test="sort">
    <xsl:attribute name="sortKey">
        <xsl:value-of select="translate(sort, ' ', '_')"/>

Thanks to Woopsydoozy for solution 2.

Tuesday, 29th January 2013

These last weeks I have been mapping the fields of the table, Bibliography, to the (what I currently think is) appropriate TEI elements. I am using version 8.16 of the EpiDoc schema for TEI P5.

I have previously settled on fitting the bibliography, which is currently in postgres, into the TEI elements <biblStruct> inside a <listBibl> and now I have settled on a list of elements and attributes for the different columns of the bibliography table:

  • id -> @n
  • sort -> @sortKey
  • script_code -> @rend =”script(   )”
  • language -> @xml:lang
  • pending -> @status=”true/false”
  • author_first -> <author> <persName><forname>
  • author_last -> <author> <persName><surname>
  • add_authors -> <author> – needs to be edited to the above
  • editor -> <monogr><editor>
  • eds -> if the author is the editor
  • translator -> <editor @role=”translator”>
  • publication_author -> <monogr><author> if incollection
  • date -> <imprint><date>
  • type -> use to define whether <monogr> or <analytic> or both and how
  • article_title -> <analytic><title @level=”a” @type=”main” @xml:lang=” ? “
  • article_title_ns -> <analytic><title @level=”a” @type=”alt” @xml:lang=”  “
  • article_title_en -> <analytic><title @level=”a” @type=”alt” @xml:lang=”eng”
  • publication_title -> <monogr><title @level=”m/j” @type=”main” @xml:lang=” ? “
  • publication_title_ns -> <monogr><title @level=”m/j” @type=”alt” @xml:lang=”  “
  • publication_title_en -> <monogr><title @level=”m/j” @type=”alt” @xml:lang=”eng”
  • series_title -> <series><title @level=”s” @type=”main” @xml:lang=” ? “
  • series_title_ns -> <series><title @level=”s” @type=”alt” @xml:lang=”  “
  • series_title_en -> <series><title @level=”s” @type=”alt” @xml:lang=”eng”
  • volume -> <biblScope @unit=”vol”>
  • issue -> <biblScope @unit=”issue”>
  • location -> <pubPlace xml:lang=” ? “>
  • location_ns -> <pubPlace xml:lang=”  “>
  • publisher -> <monogr><imprint><publisher @xml:lang=” ? “>
  • publisher_ns -> <monogr><imprint><publisher @xml:lang=” “>
  • pages -> <biblScope @unit=”pp”>
  • plates -> <biblScope @unit=”plates”>   *
  • comments -> <note>
  • url -> <ptr type=”url” @target=” ? “>
  • review_of -> <ptr type=”review_of” @target=” ? “> *
  • reviewed_by -> <ptr type=”reviewed_by” @target=” ? “> *
  • bib_id -> <idno @type=”bibID”>

*not in TEI guidelines