A major area of ongoing linguistic research at the Slavic seminar includes non-standardised varieties. This also involves the development of annotated linguistic corpora, with a particular focus on dialectal and historical collections and spoken language corpora. The current selection of corpora is available on https://gitlab.uzh.ch/uzh-slavic-corpora
Macedonian Spoken Corpus
Pre-Standardized Balkan Slavic Literature
The corpus includes various Balkan Slavic texts from the 15th-19th century. The annotated section includes 20 shorter texts with full morphological and syntactic annotation (48k tokens). The raw section contains 14 sources digitized manually or automatically as a whole (ca. 1M tokens).
Contact: Ivan Šimko