数据集:

miam

任务:

文本生成

填充掩码

文本分类

子任务:

dialogue-modeling language-modeling masked-language-modeling

语言:

计算机处理:

multilingual

大小:

10K<n<100K

语言创建人:

expert-generated

批注创建人:

expert-generated

源数据集:

original

其他:

dialogue-act-classification

许可:

cc-by-sa-4.0

数据集介绍文件清单

中文

Dataset Card for MIAM

Dataset Summary

Multilingual dIalogAct benchMark is a collection of resources for training, evaluating, and analyzing natural language understanding systems specifically designed for spoken language. Datasets are in English, French, German, Italian and Spanish. They cover a variety of domains including spontaneous speech, scripted scenarios, and joint task completion. All datasets contain dialogue act labels.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

English, French, German, Italian, Spanish.

Dataset Structure

Data Instances

Dihana Corpus

For the dihana configuration one example from the dataset is:

{
  'Speaker': 'U',
  'Utterance': 'Hola , quería obtener el horario para ir a Valencia',
  'Dialogue_Act': 9,  # 'Pregunta' ('Request')
  'Dialogue_ID': '0',
  'File_ID': 'B209_BA5c3',
}

iLISTEN Corpus

For the ilisten configuration one example from the dataset is:

{
  'Speaker': 'T_11_U11',
  'Utterance': 'ok, grazie per le informazioni',
  'Dialogue_Act': 6,  # 'KIND-ATTITUDE_SMALL-TALK'
  'Dialogue_ID': '0',
}

LORIA Corpus

For the loria configuration one example from the dataset is:

{
  'Speaker': 'Samir',
  'Utterance': 'Merci de votre visite, bonne chance, et à la prochaine !',
  'Dialogue_Act': 21,  # 'quit'
  'Dialogue_ID': '5',
  'File_ID': 'Dial_20111128_113927',
}

HCRC MapTask Corpus

For the maptask configuration one example from the dataset is:

{
  'Speaker': 'f',
  'Utterance': 'is it underneath the rope bridge or to the left',
  'Dialogue_Act': 6,  # 'query_w'
  'Dialogue_ID': '0',
  'File_ID': 'q4ec1',
}

VERBMOBIL

For the vm2 configuration one example from the dataset is:

{
  'Utterance': 'ja was sind viereinhalb Stunden Bahngerüttel gegen siebzig Minuten Turbulenzen im Flugzeug',
  'Utterance': 'Utterance',
  'Dialogue_Act': 'Dialogue_Act',  # 'INFORM'
  'Speaker': 'A',
  'Dialogue_ID': '66',
}

Data Fields

For the dihana configuration, the different fields are:

Speaker : identifier of the speaker as a string.
Utterance : Utterance as a string.
Dialogue_Act : Dialog act label of the utterance. It can be one of 'Afirmacion' (0) [Feedback_positive], 'Apertura' (1) [Opening], 'Cierre' (2) [Closing], 'Confirmacion' (3) [Acknowledge], 'Espera' (4) [Hold], 'Indefinida' (5) [Undefined], 'Negacion' (6) [Feedback_negative], 'No_entendido' (7) [Request_clarify], 'Nueva_consulta' (8) [New_request], 'Pregunta' (9) [Request] or 'Respuesta' (10) [Reply].
Dialogue_ID : identifier of the dialogue as a string.
File_ID : identifier of the source file as a string.

For the ilisten configuration, the different fields are:

Speaker : identifier of the speaker as a string.
Utterance : Utterance as a string.
Dialogue_Act : Dialog act label of the utterance. It can be one of 'AGREE' (0), 'ANSWER' (1), 'CLOSING' (2), 'ENCOURAGE-SORRY' (3), 'GENERIC-ANSWER' (4), 'INFO-REQUEST' (5), 'KIND-ATTITUDE_SMALL-TALK' (6), 'OFFER-GIVE-INFO' (7), 'OPENING' (8), 'PERSUASION-SUGGEST' (9), 'QUESTION' (10), 'REJECT' (11), 'SOLICITATION-REQ_CLARIFICATION' (12), 'STATEMENT' (13) or 'TALK-ABOUT-SELF' (14).
Dialogue_ID : identifier of the dialogue as a string.

For the loria configuration, the different fields are:

Speaker : identifier of the speaker as a string.
Utterance : Utterance as a string.
Dialogue_Act : Dialog act label of the utterance. It can be one of 'ack' (0), 'ask' (1), 'find_mold' (2), 'find_plans' (3), 'first_step' (4), 'greet' (5), 'help' (6), 'inform' (7), 'inform_engine' (8), 'inform_job' (9), 'inform_material_space' (10), 'informer_conditioner' (11), 'informer_decoration' (12), 'informer_elcomps' (13), 'informer_end_manufacturing' (14), 'kindAtt' (15), 'manufacturing_reqs' (16), 'next_step' (17), 'no' (18), 'other' (19), 'quality_control' (20), 'quit' (21), 'reqRep' (22), 'security_policies' (23), 'staff_enterprise' (24), 'staff_job' (25), 'studies_enterprise' (26), 'studies_job' (27), 'todo_failure' (28), 'todo_irreparable' (29), 'yes' (30)
Dialogue_ID : identifier of the dialogue as a string.
File_ID : identifier of the source file as a string.

For the maptask configuration, the different fields are:

Speaker : identifier of the speaker as a string.
Utterance : Utterance as a string.
Dialogue_Act : Dialog act label of the utterance. It can be one of 'acknowledge' (0), 'align' (1), 'check' (2), 'clarify' (3), 'explain' (4), 'instruct' (5), 'query_w' (6), 'query_yn' (7), 'ready' (8), 'reply_n' (9), 'reply_w' (10) or 'reply_y' (11).
Dialogue_ID : identifier of the dialogue as a string.
File_ID : identifier of the source file as a string.

For the vm2 configuration, the different fields are:

Utterance : Utterance as a string.
Dialogue_Act : Dialogue act label of the utterance. It can be one of 'ACCEPT' (0), 'BACKCHANNEL' (1), 'BYE' (2), 'CLARIFY' (3), 'CLOSE' (4), 'COMMIT' (5), 'CONFIRM' (6), 'DEFER' (7), 'DELIBERATE' (8), 'DEVIATE_SCENARIO' (9), 'EXCLUDE' (10), 'EXPLAINED_REJECT' (11), 'FEEDBACK' (12), 'FEEDBACK_NEGATIVE' (13), 'FEEDBACK_POSITIVE' (14), 'GIVE_REASON' (15), 'GREET' (16), 'INFORM' (17), 'INIT' (18), 'INTRODUCE' (19), 'NOT_CLASSIFIABLE' (20), 'OFFER' (21), 'POLITENESS_FORMULA' (22), 'REJECT' (23), 'REQUEST' (24), 'REQUEST_CLARIFY' (25), 'REQUEST_COMMENT' (26), 'REQUEST_COMMIT' (27), 'REQUEST_SUGGEST' (28), 'SUGGEST' (29), 'THANK' (30).
Speaker : Speaker as a string.
Dialogue_ID : identifier of the dialogue as a string.

Data Splits

Dataset name	Train	Valid	Test
dihana	19063	2123	2361
ilisten	1986	230	971
loria	8465	942	1047
maptask	25382	5221	5335
vm2	25060	2860	2855

Dataset Creation

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

[More Information Needed]

Who are the source language producers?

[More Information Needed]

Annotations

Annotation process

[More Information Needed]

Who are the annotators?

[More Information Needed]

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

Anonymous.

Licensing Information

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License .

Citation Information

@inproceedings{colombo-etal-2021-code,
    title = "Code-switched inspired losses for spoken dialog representations",
    author = "Colombo, Pierre  and
      Chapuis, Emile  and
      Labeau, Matthieu  and
      Clavel, Chlo{\'e}",
    booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2021",
    address = "Online and Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.emnlp-main.656",
    doi = "10.18653/v1/2021.emnlp-main.656",
    pages = "8320--8337",
    abstract = "Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation (\textit{e.g} in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks (\textit{i.e} multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.",
}

Contributions

Thanks to @eusip and @PierreColombo for adding this dataset.

作者:

佚名

数据集大小:

54.15 KB