Beginning the Doctorate

with 4 Comments

The Doctorate is going accourding to plan at the moment. That is the plan that I have just finished writting up. Accouring to my thesis proposal I will be spending the next half year building an ontology and looking into all things ontology related.

I have been spending the last half year on the Vindolanda Tablets in XML form trying to get the encoding up to the latest EpiDoc standards and incorperating a more detailed granulation. I have been looking into the idea of contexual encoding where you encode the actual words in a text like: pullu (i.e. chicken) as a word which has the lemma pullus:

<w lemma=”pullus”>pullu</w>

The next step here is use the new indexing and searching system I have developed out of this into a new website and build a web service for this which I and other can use later.

4 Responses

  1. admin
    |

    I think I understand what you are on about now.
    I agree if would be lovely if we could use a lemmatisation tool which could go through all the un-normalised text and give us the lemma’s for each, simple and easy.
    However, in this case we already had the lemma for each word in the index of the publication and since the aim of the project was not to find a clever way of lemmatising texts it didn’t seem worth while for me to look too much into this.
    I am still not aware if there is a lemma analyser out there which can search through the Vindolanda Tablets so if you have one in mind I would love to hear about it.

  2. Ryan
    |

    To clarify (since a lot of terms for search technologies are overloaded), I mean more in the sense of a morphological analyzer which uses a lemmatizing approach (usually based on a lemma dictionary, see e.g. the Perseus Morpheus tool). This would be the component that sits between the extractor (pulling the strings you want from the XML), and performs language-specific analysis (tokenization, stop word removal or bi-gramming, etc.) before inserting the processed tokens (and where they came from in the document) into the database. In this case the processed tokens would be the lemma forms automatically generated from the un-normalized transcribed words, the idea being that this would easily reflect changes in document content without every document having to contain the explicit lemma for every word.

    I may just be being optimistic, as it seems some of the automatic lemmatizers for Greek/Latin are able to get unambiguous lemmata only perhaps half of the time, so the advantage of explicitly tagging the lemmata would be that the “correct” lemma form is used all of the time.

  3. Henriette
    |

    The advantage is that by marking up all the words in the transcription we have a flexible index which can be modified on-the-fly.
    As an example I have built an extractor which adds all the lemma’s to a database. This database then works as an index through which it is much easier to search the text. And by having the word marked up in the text we can also direct the user directly to the word in the text. I have also written a small JavaScript which then allows the user to link from the actual word in the text to the occurrence of the lemma in the database.
    This is just a small example of what you can do with it and I have plans for much more.
    I wouldn’t think that there would be much more work in maintaining the marked up words in the XML than there would be maintaining a lemma dictionary.
    And a further advantage is that we can use this system to generate word lists and word frequency at a later point without having to do a lot of copy/paste.
    Hope that answered your question.
    Henriette

  4. Ryan
    |

    What are the advantages of this approach over using a single lemma dictionary at index-time? It seems to me that this would introduce a lot of repetition in trying to maintain the content and ensure consistent lemma forms.