Multi-CAST: Multilingual Corpus of Annotated Spoken Texts

Multi-CAST is an online collection of annotated spoken language corpora from a steadily expanding range of typologically diverse languages.

It features standardized annotations across multiple levels, targeting morphosyntactic structure and reference. Multi-CAST has been designed as a tool for quantitative, corpus-based typology. It is based on open-source software resources, and all data are fully accessible under a Creative Commons licence. 

- 11 corpora from diverse languages
- each corpus comprises at least 1000 clauses, for a total of 20000 clauses (c. 85000 words)
- 10 additional corpora in preparation
- multiple annotation layers for morphosyntax and referent tracking (including zero anaphora) using unified annotation schemes (GRAID, RefIND)
- a companion R package facilitates quantitative cross-corpus analysis

For a comprehensive one-stop overview, see the following document.