Multi-CAST: Multilingual Corpus of Annotated Spoken Texts

Multi-CAST is an online collection of annotated spoken language corpora from a steadily expanding range of typologically diverse languages.It features standardized annotations across multiple levels, targeting morphosyntactic structure and reference. Multi-CAST has been designed as a tool for quantitative, corpus-based typology. It is based on open-source software resources, and all data are fully accessible under a Creative Commons licence. - 11 corpora from diverse languages - each corpus comprises at least 1000 clauses, for a total of 20000 clauses (c. 85000 words) - 10 additional corpora in preparation - multiple annotation layers for morphosyntax and referent tracking (including zero anaphora) using unified annotation schemes (GRAID, RefIND) - a companion R package facilitates quantitative cross-corpus analysisFor a comprehensive one-stop overview, see the following document: