Paper accepted at NAACL 2024

03/13/2024

Paper accepted for the Conference of the North American Chapter of the Association for Computational Linguistics:

"You are an expert annotator": Automatic best–worst-scaling annotations for emotion intensity modeling.

We have a new paper accepted:

Christopher Bagdon, Prathamesh Karmalkar, Harsha Gurulingappa, and Roman Klinger. "You are an expert annotator": Automatic best–worst-scaling annotations for emotion intensity modeling. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, June 2024. Association for Computational Linguistics.

In this paper, we build on top of the observation that humans are better in comparing texts than deciding for absolute scores. As an example, if a human is asked to assign an emotion intensity score for the text

She’s quite happy.

different people will likely come up with different scores which are not really related.

If you ask people to compare two texts and decide which is more intense, it's much easier:

1. She’s quite happy.
2. He is extremely delighted.

One instance of such comparison task is Best-Worst-Scaling, where human annotators are asked to pick the most intense and the least intense instance from a set of texts, often a quadruple. Based on such BWS annotation, one can also infer a continuous score, and such scores are typically much more reliable than asking directly for a score for an isolated instance.

In the paper that we will present at NAACL, Christopher Bagdon studied if this effect also holds for tasking a large language model like GPT3 with the annotation task. We find that this is indeed the case. Therefore, the take-home message is: If you need continuous annotations for text instances, better ask ChatGPT (or alike) for comparisons/BWS and not for rating scales directly.

A preprint of the paper is available at https://www.romanklinger.de/publications/BagdonNAACL2024.pdf.