Call for Posters

Corpus linguistics is enjoying growing popularity in virtually all branches of linguistics. However, most corpora still represent just a tiny fraction of linguistic reality as they usually consist of samples of present-day written standard language. This workshop aims at bringing together researchers working on the (manual or automatic) annotation of non-standard corpora. These include, for example, historical corpora, corpora of spoken language and co-speech gesture, chat corpora, learner corpora, or corpora of signed languages. In particular, we focus on the peculiarities of syntactic and semantic annotation, with the goal of discussing best-practice models in dealing with issues of normalization and uncertainty in non-standard data.

One fairly obvious challenge is the absence of clear graphical cues for syntactic structures: While sentence boundaries are fairly easy to identify in corpora of present-day standard language on the basis of punctuation, different cues have to be taken into account in modalities other than writing. But also in written non-standard data, explicit cues for sentence boundaries can be absent or used differently. Also, some phenomena, such as ellipses, repetitions, and disfluencies, are highly frequent in non-standard data and can be linguistically meaningful. Yet, their analysis poses additional challenges since pre-processing to~ols that were trained on standard written texts do not fare well with this kind of data, and there is no generally accepted consensus on how to properly represent them. In this discussion strand, we will assess the benefits and challenges that lie in (semi-)automatic pre-processing of non-standard data, and we will discuss how much normalization is needed.

As non-standard corpora tend to be fairly small, they are also well-suited for extensive (often manual) semantic annotation. However, their non-standard nature can entail challenges that go beyond the usual difficulties of semantic analyses. For example, coding historical language data for semantic aspects requires advanced knowledge not only of the language of the time period but also of the cultural environment in which the texts were created. Also, non-standard language can be expected to give rise to more ambiguities than standard language. This is especially true for spoken language, which can rely on cues from the situational context to a larger extent, or on shared knowledge between the discourse participants that is not necessarily accessible to the person analyzing a conversation. With the purpose of transparent documentation and re-usability in mind, the question arises whether such uncertainties actually should be resolved in all cases, and what annotation and data representation strategy should be adopted.

The issue of transparency and re-usability also pertains to other aspects that call for a trade-off between different conflicting interests. For syntactic annotation, community standards like the tagset developed in the Universal Dependencies project enjoy growing popularity and have successfully been applied to non-standard, especially spoken data. However, for semantics and discourse annotation, the issue of comparability and generalizability of annotation categories is still pervasive, as annotations tend to be made by individual researchers for very specific projects. Hence, another goal of the workshop will be to discuss best-practice strategies for the development of annotation guidelines which satisfy the needs of a specific research question while keeping potential future users in mind.

Our workshop aims to provide a discussion forum for these and other open questions relating to the annotation of non-standard corpora. In addition, its goal is to establish a network between researchers from different linguistic disciplines facing similar challenges on very different datasets.

In addition to the oral presentations, we invite abstracts for poster presentations. If you are working on a project that is relevant to the subject area of the workshop and that you would like to present as a poster, please send a short abstract (~500 words) to stefan1.hartmann[at]uni-bamberg.de by July 10th, 2019. Notifications of acceptance/rejection can be expected by July 20th.