Yela Schauwecker, Achim Stein
Automatic Morphosyntactic and Dependency Annotation of the Anglo-Norman Text Database
Abstract Non-standardized languages are an immense challenge for automatic annotation. This paper discusses the case of Anglo-Norman (AN), which is the variety of Old French (OF) spoken and written in medieval England for over 300 years, until well after 1400. In addition to presenting the irregularities in, for example spelling, inflection and word-order that are also characteristic of OF, AN developed particular spelling variants, shows even less consistent case-marking and considerable diachronic variation between the earliest (c1112) and the latest (c1440) texts in the Anglo-Norman text database (Rothwell and Trotter 2005; henceforth “ANdb”).
We present the first attempt to provide an automatic grammatical analysis of the ANdb. We applied machine-learning techniques combined with lexicon-driven tools that were trained on OF resources. This paper is organized according to the individual steps in the annotation process: section 1 gives a succinct overview of the historical context and some relevant linguistic peculiarities of AN. Section 2 deals with the automated graphical “normalisation” of the texts. We generated regularized spellings that temporarily substituted the graphical forms during the annotation process to improve the accuracy of lemmatisation, part-of-speech tagging, and dependency parsing. Section 3 describes how a dependency parser developed for Old French was applied to the normalised version of the AN data, and discusses the usefulness of the parsed output for historical syntactic research.
Keywords Dependency parsing, part of speech tagging, automatic spelling normalisation, Anglo-Norman, Old French historical corpora