# Introduction

This zip file contains supplementary material for the paper: 

[Are Humans as Brittle as Large Language Models?](https://arxiv.org/abs/2509.07869)

by Jiahui Li, Sean Papay, Roman Klinger

This zip file consists of related survey information of human study in the `surveys` directory, the results of human study in the `human-data` directory, and model predictions of the three datasets below in the `llms-predictions` directory. 

# Original source
This paper employs the source text from the following datasets:

- [crowd-enVent](https://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/emotionappraisal/): we use the corpus/crowd-enVent_validation.tsv file for the emotion classification task.
- [POPQUORN](https://github.com/Jiaxin-Pei/Potato-Prolific-Dataset): we use the Potato-Prolific-Dataset/dataset
/offensiveness/raw_data.csv and Potato-Prolific-Dataset/dataset/politeness_rating/raw_data.csv for the offensiveness and politeness rating tasks.
- [EPIC](https://huggingface.co/datasets/Multilingual-Perspectivist-NLU/EPIC): we use the EPICorpus for the irony detection task.

# Content

- surveys/offensiveness-rating/: This folder contains the information of the survey outline in our human study for the offensiveness rating task. We provide the screenshots of the instruction paper, demographic information collection page, and examples for all evaluated prompt variants shown to the annotators.
- surveys/emotion-classification/: This folder contains the information of the survey outline in our human study for the emotion classification task. We provide the screenshots of the instruction paper, demographic information collection page, examples for all evaluated prompt variants shown to the annotators, and an explanation page when the survey includes typos.
- llms-predictions/: This folder contains the LLMs predictions for the evaluated tasks, including "{Model Name}_crowd-enVent.csv" for the emotion classification task, "{Model Name}_EPICorpus.csv" for the irony detection task, "{Model Name}_offensiveness.csv" and "{Model Name}_politeness.csv" for the offensiveness and politeness rating tasks.
- human-data/offensiveness-human.csv: This file contains the human annotated data for the offensiveness rating task.
- human-data/emotion-human.csv: This file contains the human annotated data for the emotion classification task.

# Variables

- instance_id: instance id of the text.
- prolific_id: anonymized prolific id of human annotators.
- label: human annotated label.
- original_label: unmapped label by human annotators according to different prompts.
- text: text for the task.
- prompt: prompt used in the human study for each instance.
- prompt_type: prompt type proposed in the paper for each prompt.
- predict_direct_{index}_map: the mapped label by LLM prediction. {index} refers to the prompt index used in LLM invoke. The index follows the order of prompts shown in the Appendix A.3 section in the paper.

# Citation
```
@inproceedings{li2025humansbrittlelargelanguage,
  title = {Are Humans as Brittle as Large Language Models?},
  author = {Jiahui Li and Sean Papay and Roman Klinger},
  year = {2025},
  eprint = {2509.07869},
  booktitle = {Proceedings of the 14th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  address = {Mumbai, India},
  archiveprefix = {arXiv},
  primaryclass = {cs.CL},
  url = {https://arxiv.org/abs/2509.07869},
} ```