The text of the first volume - which was my source - has 360.000 tokens. Version 11 comes with a built-in server, which can be interrogated from other programs and returns its answer via the clipboard. I have been using it recently for lemmatizing Early Modern Latin texts. It is maintained by Yves Ouvrard (), and available under the GNU GPL licence. The software I used was Collatinus, described on its website as a ‘Lemmatiser and morphological analyser for Latin texts’, i.e. This collection of texts seemed the perfect test case for an idea I have toyed with for some time, to use a lemmatizer as spell-checker. Also there were many split words, since the print often does not use hyphens when splitting words at the linebreak.Įnclitic -que was easily resolved since the words ending in -que are a closed class in Latin with very few ambigous cases (most importantly ‘quoque’, either meaning ‘etiam’ or ‘et quo’). Thus, among many mistakes, there was a constant misreading of long ‘s’ as ‘f’, the ligature for enclitic ‘-que’ as ‘q’ or ‘q ’, loss of hyphens or replacement by ‘.’. The two volumes had been OCR’ed some time ago with poor luck, partly to do with the poor quality of the digitization used as source, partly with the lack of a suitable OCR software. This is a collection of texts on the theory of historiography in two volumes large part of the first volume is occupied by Bodin’s Methodus ad facilem historiae cognitionem. Recently I had the chance to work with a text of the Artis historicae penus, printed Basileae 1579. This blog entry is about an attempt to use lemmatization (the software ‘Collatinus’) to recognize and isolate errors in the OCR of a 16th century print, and in some cases to reconstitute words separated by incorrectly recognized word borders.
0 Comments
Leave a Reply. |