Zurich Corpora of Slavic Varieties (ZuCoSlaV)

A major area of ongoing linguistic research at the Slavic seminar includes non-standardised varieties. This also involves the development of annotated linguistic corpora, with a particular focus on dialectal and historical collections and spoken language corpora. The current selection of corpora is available on  https://gitlab.uzh.ch/uzh-slavic-corpora

Macedonian Spoken Corpus


The corpus comprises transcriptions of audio files collected in a series of field research trips in the Prespa, Bitola and Debar regions in 2012, 2014, 2016 and 2019.

Contact: Anastasia Escher


Pre-Standardized Balkan Slavic Literature


The corpus includes various Balkan Slavic texts from the 15th-19th century. The annotated section includes 20 shorter texts with full morphological and syntactic annotation (48k tokens). The raw section contains 14 sources digitized manually or automatically as a whole (ca. 1M tokens).

Contact: Ivan Šimko




The corpus contains transcripts of interviews about traditional culture and history with speakers of Torlak from the Timok area. It comprises 500,697 tokens representing 80 h of recording.

Contact: Teodora Vuković



Serbian Forms of Address


The corpus contains transcripts of interviews about forms of address that Serbian speakers use in colloquial and formal settings. It consists of 171,552 tokens, corresponding to about 19 h of recording.

PhD project of Sonja Ulrich

Contact: Dolores Lemmenmeier-BatinićSonja Ulrich