Ongoing projects

Multi-CAST: Multilingual Corpus of Annotated Spoken Texts

The official Multi-CAST-Website is available here.

For a comprehensive one-stop overview, see the following document.

Personal pronouns and person clitics in Tabasaran: Toward a theory of Person

Funded by Deutsche Forschungsgesellschaft

Researcher: Dr. Natalia Bogomolova

Project Manager: Dr. Natalia Bogomolova, Prof. Dr. Geoffrey Haig

Funding period: September 1st, 2022 to August 31st, 2025 (36 month)

The project has two major goals: an in-depth investigation of person in the grammar of Tabasaran (Nakh-Daghestanian) and using new empirical data from this understudied language in theoretical syntactic research in order to advance the theory of Person. The category of person in Tabasaran is manifested via two systems: free personal pronouns and an elaborate system of person clitics, which display a number of interesting properties. First, both subject and non-subject person arguments can be marked on the finite verb by a clitic. Clusters of a subject and a non-subject clitics are also possible. Second, person subjects of transitive and intransitive clauses and non-canonical subjects behave differently with respect to clitics. In root declarative clauses, canonical subjects are always clitic-doubled, while in clauses with non-canonical subjects both subject and non-subject can trigger aclitic on the verb. Third, allowing cliticclusters, Tabasaran demonstrates a phenomenon known as Person–Case Constraint, reminiscent of what is attested in Romance languages, with some important differences. Fourth, both pronouns and clitics exhibit indexical shift in speech reports, losing their indexical semantics and referring to the arguments of the matrix clause. The proposed project collects and analyzes a substantial body of new empirical data, challenging for syntactic theory, puts current approaches under scrutiny with regard to their ability to deal with those facts, and modifies them to the point of a better understanding how information about person is conveyed in human language.

Post-predicate Elements in Iranian: Inheritance, Contact, and Information Structure

Funded by the Alexander-von-Humboldt-Stiftung

Funding period: 01.07.2019-30.06.2022
PI's: Geoffrey Haig (Bamberg); Mohammad Rasekh-Mahand (Hamedan)

Iranian languages are routinely classified as "verb final". While this is true with regard to the position of (non-pronominal) direct objects, which are generally pre-verbal, in several West Iranian languages, certain other constituents occur more or less systematically after the verb. The result is a typologically unusual and hitherto largely ignored OVX word order type within West Iranian. Furthermore, OVX word order has been identified in unrelated languages in contact with Iranian, including Turkic, and Neo-Aramaic.

This project brings together leading international experts on Iranian and neighbouring languages in order to explore

  • the extent of OVX word order within Iranian, and its genesis within the family
  • the areal spread of OVX word order in neighbouring languages, and the pathways of transmission
  • information-structural correlates of  OVX word order
  • typological implications of OVX word order.

For more information click here

Does morphosyntactic alignment shape discourse? Implementing a corpus-based approach to linguistic typology

This project is a proof-of-concept study for corpus-based approaches to typology. We address the question of whether typological differences in the morphosyntax of individual languages are reflected in the organization of spontaneous spoken discourse of those languages, with a special focus on so-called ergative languages. While claims of a co-dependence between grammar and discourse have regularly been made in the literature (Hopper 1983, Du Bois 2003, Durie 2003), the issue has never been systematically investigated on a more representative language sample.

The project builds on an existing language archive architecture (Multi-CAST, The Multilingual Corpus of Annotated Spoken Texts, online here), and implements an expanded version of the syntactic annotation system GRAID (Grammatical Relations and Animacy in Discourse, Haig & Schnell 2014, manual here). The existing language sample in Multi-CAST is being extended by the inclusion of ergative languages from the Nakh-Daghestanian language family and from Australia, and of data from Phillippine-type languages. All corpora are subjected to a standardized annotation procedure, and the resulting data feed into quantitative cross-corpus analysis in order to identify significant statistical patterns in connected discourse, for example:

  • the distribution of referential expressions across syntactic functions,
  • the density of zero-anaphora,
  • patterns of new-referent introduction,
  • division of labour among pronouns and lexical expressions,
  • the impact of animacy on syntactic configurations

The resulting dataset, the first of its kind worldwide, aids the detection of possible correlations between the alignment of morphosyntax, and probabilistic patterning in the way connected spoken language is organized.

The project is being coordinated by Geoffrey Haig, Stefan Schnell, and Nils Schiborr at the University of Bamberg, and runs in collaboration with researchers from the Centre of Excellence for Dynamics of Language, Canberra and Melbourne (Nick Thieberger), and the University of Jena (Diana Forker).

The project is supported by a DFG grant (project number 323627599), for an initial period of 2017–2020.

Bamberg Lexical Database for Contemporary Iranian Languages (BLDCIL)

Background and aims

The sub-classification of Iranian languages has proven to be a particularly recalcitrant problem in historical linguistics (see Korn 2016 for recent proposals, DOI: 10.1515/if-2016-0021). This project aims to complement and extend existing scholarship by applying a phylogenetic  approach, based on lexical comparison, to the problem; see e.g. Heggarty et al. (2010) for the background to this kind of approach (

The projects point of  departure is the question of the sub-grouping of Zazaki, a West Iranian language spoken in Central Anatolia (in todays Turkey), but historically not closely related to its current geographical neighbours (Northern Kurdish); see Gippert (2007/2008 for a summary or relevant scholarship,

The aims of the project are thus two-fold: (i) to apply a novel methodology to an old problem in the sub-classification of  Iranian languages; (ii) to serve as a proof-of-concept for the efficacy (or otherwise) of phylogenetic models in resolving classic problems of philology. The first phase, beginning in November 2016, involves the comilation of standardized lexical data sets, together with sound files, from a representative set of Iranian languages, focussing initially on the West Iranian languages.


The project is closely linked with two existing initiatives: 

The CoBL (Cognacy in Basic Lexicon) database at the Max-Planck-Institute in Jena (Paul Heggarty and Cormac Anderson,  see

The Atlas of the Languages of Iran project (Editor Erik Anonby,

The Jena/Bamberg Iranian List (JBIL) of meanings

The JBIL-list is a list of meanings, which includes the 200 items used in the CoBL project, and 80 items used in the Atlas of the Languages of Iran, plus a number of other items deemed of interest for Iranian languages. The items themselves, plus explanations and instructions for investigators, are available as downloads  below:

  • The JBIL-list, with explanations and example sentences, and instructions for investigators (pdf(247.0 KB, 20 pages))
  • The JBIL-list, with Persian translations and Persian example sentences (pdf(470.0 KB, 25 pages))

  • The Data Entry Form, into which the actual forms for each language may be entered. (doc(98.5 KB, 21 pages))



Data sets have been compiled, or are in the process of compilation, for the following languages:


Behdinî Kurdish



Jafi Kurdish





Sample Data

Sample data sets will be made available shortly here



The LDBCIL project is coordinated by Geoffrey Haig (Bamberg) and Erik Anonby (Carleton/Bamberg). Data collection and handling is undertaken with the assistance of (in alphabetical order): Shirin Adibifar (Bamberg), Raheleh Izadifar (Hamedan), Mina Salehi (Bamberg), Mortaza Taheri-Ardali (Shahr-e Kord University).



The project gratefully acknowledges the financial and technical support of the Max-Planck Institute for the Science of Human History (CoBL-Database), the University of Bamberg for departmental funding, and the Dept. of Linguistics at the University of Hamedan as a cooperation partner in the Islamic Republic of Iran.

Previous projects

Atlas of the Languages of Iran (Chief Editor: Erik Anonby)

Project for the creation of a digital language map of Iran. For more information click here

Documenting Dargi languages in Daghestan - Shiri and Sanzhi

In this project, three linguists (Diana Forker, Rasul Mutalov and Oleg Belvaev) and an ethnographer (Iwona Kaliszewska) will document and analyze Shiri and Sanzhi and the culture of the Shiri and Sanzhi people. Shiri and Sanzhi belong to two different Dargi languages (Nakh-Daghestanian), spoken in the central part of Daghestan in the Caucasus (Russia). The languages are heavily endangered. We estimate that there are only about 200 Shiri families and about 100 Sanzhi families left.

The project aims at a detailed and in-depth documentation of Shiri and Sanzhi through the collection of texts from a wide range of genres. These texts will be made available to the public via the DoBeS archive (

In the linguistic documentation and analysis of Shiri and Sanzhi we will pay special attention to those features that are unusual for the Nakh-Daghestanian language family and of broader typological interest. Two of these features are person agreement, which is based on the person hierarchy and not determined by grammatical roles, and extraordinarily rich TAM and evidentiality paradigms.

In our project we will collaborate with Russian colleagues (e.g. Nina Sumbatova) and colleagues from the University of Jena (Kevin Tuite, Florian Mühlfried). But our main cooperation partners will be Daghestanian researchers, students and the Shiri and Sanzhi communities.

The project ist funded by the DoBeS program of the VW foundation ( It started in summer 2012 and runs for three years.

For further information please visit our project page:

Chirag Documentation Project


Prof. Dr. Geoffrey Haig, Dr. Dmitry Ganenkov, Dr. Natasha Bogomolova

Project Details:

Major Documentation Project. Duration: 2014-2017. 126.000 EUR

Project Summary:

The project will document Chirag, an endangered language from the Dargwa branch of the East Caucasian (Nakh-Daghestanian) family, spoken in Daghestan, Russia (2100-2400 speakers). The main goal of the project is to collect a rich corpus of audio/video data from both traditional narratives and everyday communication. I propose to record about 110 hours of Chirag (spontaneous speech, lexical and grammatical elicitation), of which at least 25 hours of spontaneous speech will be transcribed, morphologically analyzed and translated to produce an annotated corpus of Chirag available on the internet.

Information can be found here.

Compilation and critical edition of pre-19th century Kurmanji Kurdish


Dr. Ergin Öpengin

Project Details:

Deutsche Forschungsgemeinschaft (DFG). Duration: 10/2014 - 03/2016. 134.767 EUR

Project Summary:

Kurmanji Kurdish is one of the most widely-spoken languages of the Middle East, but research on its history and development is severely hampered due to the lack of written attestation prior to the 15th century. Furthermore, the few samples of Kurdish prose that can reliably be ascribed to the period 15th-19th centuries are largely inaccessible to a wider scholarly audience, and lack reliable critical apparatus. This project will compile a selection of 10 Kurdish texts from prior to 1800, transliterated in a standardized format and supplied with English translations and an authoritative critical apparatus. The texts will also be made fully accessible as a digital corpus, accompanied by a concordance, and the resulting two volumes will be published on the open-access portal of the University of Bamberg. Issues of authorship and localization of the texts will also be assessed in the light of the applicant’s ongoing research on regional variation in Kurdish, which allows a much finer-grained evaluation than has previously been possible. The project will thus lay the foundation for serious academic research on the history of Kurdish by creating an open-source research resource for questions relating to the history of the Kurdish language(s) itself, to the issue of the position of Kurdish within west Iranian languages, reconstructing the linguistic ecology of Kurdistan in the Ottoman period, assessing the timing of contact phenomena and of language change, and of issues of literary and religious scribal practices in the period.

Agreement in Discourse

This project explores the function of agreement in natural texts. The concept of agreement has played a key role for various domains of linguistic theory (morphology, syntax, semantics), and there are a number of different approaches to modeling it. However, there is still no generally accepted explanation of its function: Why should languages so often develop agreement in their grammars? In his seminal work on agreement, Corbett (2006: 274-275; see also Lehmann 1988, Levin 2001: 21-27, Kibrik 2011) proposes four possible functions of agreement, among which the most important are:

  1. Agreement provides additional redundant (repeated) information to facilitate understanding for the hearer.
  2. Agreement helps the hearer to keep track of the different referents in a discourse.

Remarkably, the two central claims (agreement is redundant, and agreement is referential) continue to be repeated in the literature, despite the fact that, with very few exceptions, neither has ever been subjected to more rigorous testing (cf. Siewierska 1998, Bickel 2003), and both clearly admit counter examples.

Thus, the aim of this project is to test the proposed functions of agreement against a sample of 20 languages from all around the world. The results will be of central importance to the language science, and a test case for the applicability of text-based, as opposed to grammar-based, typology.

The project is conducted by Diana Forker and financially supported by the Daimler and Benz Foundation (

As part of the project Diana Forker and Geoffrey Haig organize a workshop at the University of Bamberg (1-2 February, 2013).

Documentation of Gorani, an endangered language of West Iran

This is a collaborative project, funded by the VW-foundation’s Programme Dokumentation Bedrohter Sprachen (DoBeS). The project was originally granted for three years (2007-2010), but has been extended till 2012. The project is a collaborative project, conducted together with Professor Ludwig Paul (Hamburg) and Professor Philip Kreyenbroek (Göttingen). Information on the project can be found here.