数据集:
miam
计算机处理:
multilingual大小:
10K<n<100K语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original许可:
Multilingual dIalogAct benchMark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. Datasets are in English, French, German, Italian and Spanish. They cover a variety of domains including spontaneous speech, scripted scenarios, and joint task completion. All datasets contain dialogue act labels.
[More Information Needed]
English, French, German, Italian, Spanish.
For the dihana configuration one example from the dataset is:
{
'Speaker': 'U',
'Utterance': 'Hola , quería obtener el horario para ir a Valencia',
'Dialogue_Act': 9, # 'Pregunta' ('Request')
'Dialogue_ID': '0',
'File_ID': 'B209_BA5c3',
}
iLISTEN Corpus
For the ilisten configuration one example from the dataset is:
{
'Speaker': 'T_11_U11',
'Utterance': 'ok, grazie per le informazioni',
'Dialogue_Act': 6, # 'KIND-ATTITUDE_SMALL-TALK'
'Dialogue_ID': '0',
}
LORIA Corpus
For the loria configuration one example from the dataset is:
{
'Speaker': 'Samir',
'Utterance': 'Merci de votre visite, bonne chance, et à la prochaine !',
'Dialogue_Act': 21, # 'quit'
'Dialogue_ID': '5',
'File_ID': 'Dial_20111128_113927',
}
HCRC MapTask Corpus
For the maptask configuration one example from the dataset is:
{
'Speaker': 'f',
'Utterance': 'is it underneath the rope bridge or to the left',
'Dialogue_Act': 6, # 'query_w'
'Dialogue_ID': '0',
'File_ID': 'q4ec1',
}
VERBMOBIL
For the vm2 configuration one example from the dataset is:
{
'Utterance': 'ja was sind viereinhalb Stunden Bahngerüttel gegen siebzig Minuten Turbulenzen im Flugzeug',
'Utterance': 'Utterance',
'Dialogue_Act': 'Dialogue_Act', # 'INFORM'
'Speaker': 'A',
'Dialogue_ID': '66',
}
For the dihana configuration, the different fields are:
For the ilisten configuration, the different fields are:
For the loria configuration, the different fields are:
For the maptask configuration, the different fields are:
For the vm2 configuration, the different fields are:
| Dataset name | Train | Valid | Test |
|---|---|---|---|
| dihana | 19063 | 2123 | 2361 |
| ilisten | 1986 | 230 | 971 |
| loria | 8465 | 942 | 1047 |
| maptask | 25382 | 5221 | 5335 |
| vm2 | 25060 | 2860 | 2855 |
[More Information Needed]
[More Information Needed]
Who are the source language producers?[More Information Needed]
[More Information Needed]
Who are the annotators?[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
[More Information Needed]
Anonymous.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License .
@inproceedings{colombo-etal-2021-code,
title = "Code-switched inspired losses for spoken dialog representations",
author = "Colombo, Pierre and
Chapuis, Emile and
Labeau, Matthieu and
Clavel, Chlo{\'e}",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.656",
doi = "10.18653/v1/2021.emnlp-main.656",
pages = "8320--8337",
abstract = "Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation (\textit{e.g} in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks (\textit{i.e} multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.",
}
Thanks to @eusip and @PierreColombo for adding this dataset.