The Integral LET'S GO! Dataset

The integral Let's Go! dataset was originally provided by the Language Technology Institute of Carnegie Mellon University in Pittsburgh and has now moved to the Natural Langauge Generation and Dialogue Systems group at the University of Bamberg, Germany.

The dataset was obtained from use of the Let’s Go dialog system and its derivatives. The Let’s Go data and system have been used in over 22 theses and over 250 non-CMU publications. Let’s Go was funded by the National Science Foundation. Arguably the largest publicly available real user dataset at the time of its release, Let’s Go went live to real users on March 5, 2005. The Let’s Go system was connected to the public information phone number for the Port Authority of Allegheny County. During daytime, human operators manned this number, but after 7pm and until 6am the next day, and for longer periods on the weekends, all calls were routed to Let’s Go. The route schedule changed three to four times a year with some routes being eliminated over time. Let’s Go began with coverage of only the East End neighborhoods of Pittsburgh, but in later years it covered all of the Port Authority’s routes.

There are a total of 171,128 dialogs in this Let’s Go dataset. A total of 104,663 of these are at least three turns long. This is important since it is the minimum length needed in order for the system to get enough information for a backend lookup (and thus possibly have a successful dialog). But the user could have repeated information or changed a request and so not filled all of the slots in three turns. A total of 93,690 dialogs in this dataset had a backend lookup. This is the measure (at least three turns and a backend lookup) that the Let’s Go team used for the estimated success rate. We note that although it means that the system found information and gave it to the user, this is only an estimation of success since the system could have looked up and given the wrong information (due to ASR errors, for example), but it was at the time one indication that allowed the ream to compare different versions of the system. For example, during the switch from the system-directed “where are you leaving from” to the more general “How can I help you?”, the estimated success rate was used at first to determine whether the system could deal with the general question.

Please note the license for use of this data. Please agree to this license before downloading the data.

Introduction

Let’s Go! is a spoken dialog system that was used by the general public. Let’s Go! gave bus information scheduling for the Allegheny County Port Authority Transit bus system via a telephone-based interface to access bus schedules and route information.

Let’s Go has been integrated with the DialPort project.

Description

There are eight components in this dataset that are available for download:

  • The integral Let's Go dataset: The integral Let's Go dataset has 171,128 dialogs from 08/01/2005 to 03/15/2016. This includes the WAV file, the log file, and labels automatically generated by the ASR (Sphinx, PocketSphinx).
  • Subset - The Spoken Dialog Challenge: The Spoken Dialog Challenge took place in 2010. It compared how different spoken dialog systems perform on the same task. Bus Information was the task. Four teams provided systems that were first tested in controlled conditions with speech researchers as users. The three most stable systems were then deployed to the Let’s Go real callers.
  • Subset - Dialog State Tracking Challenge (DSTC1): The Dialog State Tracking Challenge (DSTC) is an ongoing series of research community challenge tasks. Each DSTC released dialog data labeled with dialog state information, such as the user’s restaurant search query given all of the dialog history up to the current turn. The challenge is to create a “tracker” that can predict the next dialog state. In each challenge, trackers have been evaluated using held-out dialog data. (Williams et al. 2013)
  • Log of Events and System Changes: An Excel logfile describes all significant changes to the system. Changes include: changes to the system architecture: bus schedule changes; changes in the reporting mechanism; events, such as Challenges, etc.
  • Crowdsourced Annotations from one year of Let's Go data: These annotations are word transcriptions of each dialog from 200810 to 200909. They were made by Amazon Mechanical Turk workers. This dataset includes the WAV file id, ASR output with confidence, and crowdsourced transcriptions with confidence. (Parent et al. 2010).
  • Let's Go Daily Reports 2006 - 2016: The Let's Go Daily Report was emailed to members of the Let’s Go team daily (there is also a weekly summary). The original idea was that the Daily Report would make the team aware of any system malfunctions during the previous evening that the system was not able to directly warn everyone about by email. It covers 2006 – 2016 and gives statistics on the number of dialogs that day, average number of turns per dialog, estimated success rate (via backend lookup), etc. (the links in each file).

Download

You can obtain the integral Let's Go dataset through a separate github repository. It contains shell scripts and instructions to initiate the download.

Notes:

  1. The integral Let's Go dataset is very large (715GB in total). Please make sure you have enough disk space before downloading.
  2. For MacOS users, you need to install GNU `date` (using command `brew install coreutils`) to use the script properly. After GNU coreutils is installed, simply change `date` in the script to `gdate`.

Contact

If you have more questions about the Let's Go dataset. Please contact:

Stefan Ultes (University of Bamberg)

References

Antoine Raux, Dan Bohus, Brian Langner, Alan W Black, and Maxine Eskenazi. Doing research on a deployed spoken dialogue system: One year of let’s go! experience in Proc. of Interspeech, 2006.

Antoine Raux, Brian Langner, Alan W. Black, and Maxine Eskenazi. LET'S GO: Improving Spoken Dialog Systems for the Elderly and Non-Natives In Proc. of Eurospeech, 2003.

Antoine Raux, Brian Langner, Dan Bohus, Alan W Black, and Maxine Eskenazi. Let’s Go Public! Taking a Spoken Dialog System to the Real World. in Proc. of Interspeech, 2005.

Alan W Black, Susanne Burger, Brian Langner, Gabriel Parent, and Maxine Eskenazi. Spoken Dialog Challenge 2010 in Proc. of SLT, 2010.

Alan W Black, Susanne Burger, Alistair Conkie, Helen Hastie, Simon Keizer, Oliver Lemon, Nicolas Merigaud, Gabriel Parent, Gabriel Schubiner, Blaise Thomson, Jason D Williams, Kai Yu, Steve Young, and Maxine Eskenazi. Spoken Dialog Challenge 2010: comparison of live and control test results in Proc. of SIGDIAL, 2011.

Jason Williams, Antoine Raux , Deepak Ramachandran, and Alan Black. The Dialog State Tracking Challenge in Proc. of SIGDIAL, 2013.

Gabriel Parent and Maxine Eskenazi. Toward better crowdsourced transcription: Transcription of a year of the let's go bus information system data in Proc. of SLT, 2010.

License

Please download and agree to the License(1.4 KB).

If you download and use the Let's Go data, you agree that you will cite it in all publications resulting from its use.

Acknowledgement

This work was supported by the US National Science Foundation under grants number 0208835 and 0855058, "LET'S GO: Improved Speech Interfaces For The General Public" and "CI-ADDO-NEW: Dialog Research Center (DialRC)". Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

We would like to thank the following researchers for their contributions to the Let's Go system and dataset:

Antoine Raux, Brian Langner, Dan Bohus, Gabriel Parent, Jim Valenti, Gabriel Schubiner, Sungjin Lee, Yulun Du, Alan Black, Maxine Eskenazi