数据集:
silicone
语言:
计算机处理:
monolingual语言创建人:
expert-generated批注创建人:
expert-generated源数据集:
original预印本库:
arxiv:2009.11152许可:
SILICONE(口语自然语言理解评估基准)是一个特别设计用于口语的自然语言理解系统的资源集合,用于训练、评估和分析。所有数据集都是英文的,涵盖了各种领域,包括日常生活、脚本场景、联合任务完成、电话对话和电视对话。部分数据集还包括情感和/或情绪标签。
【需要更多信息】
英文。
在 dyda_da 配置下,数据集中的一个示例为:
{
'Utterance': "the taxi drivers are on strike again .",
'Dialogue_Act': 2, # "inform"
'Dialogue_ID': "2"
}
DailyDialog Act Corpus(情感) 在 dyda_e 配置下,数据集中的一个示例为:
{
'Utterance': "'oh , breaktime flies .'",
'Emotion': 5, # "sadness"
'Dialogue_ID': "997"
}
Interactive Emotional Dyadic Motion Capture(IEMOCAP)数据库 在 iemocap 配置下,数据集中的一个示例为:
{
'Dialogue_ID': "Ses04F_script03_2",
'Utterance_ID': "Ses04F_script03_2_F025",
'Utterance': "You're quite insufferable. I expect it's because you're drunk.",
'Emotion': 0, # "ang"
}
HCRC MapTask Corpus 在 maptask 配置下,数据集中的一个示例为:
{
'Speaker': "f",
'Utterance': "i think that would bring me over the crevasse",
'Dialogue_Act': 4, # "explain"
}
Multimodal EmotionLines Dataset(情感) 在 meld_e 配置下,数据集中的一个示例为:
{
'Utterance': "'Push 'em out , push 'em out , harder , harder .'",
'Speaker': "Joey",
'Emotion': 3, # "joy"
'Dialogue_ID': "1",
'Utterance_ID': "2"
}
Multimodal EmotionLines Dataset(情感) 在 meld_s 配置下,数据集中的一个示例为:
{
'Utterance': "'Okay , y'know what ? There is no more left , left !'",
'Speaker': "Rachel",
'Sentiment': 0, # "negative"
'Dialogue_ID': "2",
'Utterance_ID': "4"
}
ICSI MRDA Corpus 在 mrda 配置下,数据集中的一个示例为:
{
'Utterance_ID': "Bed006-c2_0073656_0076706",
'Dialogue_Act': 0, # "s"
'Channel_ID': "Bed006-c2",
'Speaker': "mn015",
'Dialogue_ID': "Bed006",
'Utterance': "keith is not technically one of us yet ."
}
BT OASIS Corpus 在 oasis 配置下,数据集中的一个示例为:
{
'Speaker': "b",
'Utterance': "when i rang up um when i rang to find out why she said oh well your card's been declined",
'Dialogue_Act': 21, # "inform"
}
SEMAINE数据库 在 sem 配置下,数据集中的一个示例为:
{
'Utterance': "can you think of somebody who is like that ?",
'NbPairInSession': "11",
'Dialogue_ID': "59",
'SpeechTurn': "674",
'Speaker': "Agent",
'Sentiment': 1, # "Neutral"
}
Switchboard Dialog Act(SwDA)Corpus 在 swda 配置下,数据集中的一个示例为:
{
'Utterance': "but i 'd probably say that 's roughly right .",
'Dialogue_Act': 33, # "aap_am"
'From_Caller': "1255",
'To_Caller': "1087",
'Topic': "CRIME",
'Dialogue_ID': "818",
'Conv_ID': "sw2836",
}
对于 dyda_da 配置,不同的字段包括:
对于 dyda_e 配置,不同的字段包括:
对于 iemocap 配置,不同的字段包括:
对于 maptask 配置,不同的字段包括:
对于 meld_e 配置,不同的字段包括:
对于 meld_s 配置,不同的字段包括:
对于 mrda 配置,不同的字段包括:
对于 oasis 配置,不同的字段包括:
对于 sem 配置,不同的字段包括:
对于 swda 配置,不同的字段包括:
Utterance:字符串形式的话语。 Dialogue_Act:话语的对话行为标签。可以是 "sd"(0)[Statement-non-opinion],"b"(1)[Acknowledge (Backchannel)],"sv"(2)[Statement-opinion],"%"(3)[Uninterpretable],"aa"(4)[Agree/Accept],"ba"(5)[Appreciation],"fc"(6)[Conventional-closing],"qw"(7)[Wh-Question],"nn"(8)[No Answers],"bk"(9)[Response Acknowledgement],"h"(10)[Hedge],"qy^d"(11)[Declarative Yes-No-Question],"bh"(12)[Backchannel in Question Form],"^q"(13)[Quotation],"bf"(14)[Summarize/Reformulate],'fo_o_fw_" by_bc'(15)[Other],'fo_o_fw_by_bc "'(16)[Other],"na"(17)[Affirmative Non-yes Answers],"ad"(18)[Action-directive],"^2"(19)[Collaborative Completion],"b^m"(20)[Repeat-phrase],"qo"(21)[Open-Question],"qh"(22)[Rhetorical-Question],"^h"(23)[Hold Before Answer/Agreement],"ar"(24)[Reject],"ng"(25)[Negative Non-no Answers],"br"(26)[Signal-non-understanding],"no"(27)[Other Answers],"fp"(28)[Conventional-opening],"qrr"(29)[Or-Clause],"arp_nd"(30)[Dispreferred Answers],"t3"(31)[3rd-party-talk],"oo_co_cc"(32)[Offers, Options Commits],"aap_am"(33)[Maybe/Accept-part],"t1"(34)[Downplayer],"bd"(35)[Self-talk],"^g"(36)[Tag-Question],"qw^d"(37)[Declarative Wh-Question],"fa"(38)[Apology],"ft"(39)[Thanking],"+"(40)[Unknown],"x"(41)[Unknown],"ny"(42)[Unknown],"sv_fx"(43)[Unknown],"qy_qr"(44)[Unknown]或 "ba_fe"(45)[Unknown]。 From_Caller:来源呼叫者的标识符(字符串形式)。 To_Caller:目标呼叫者的标识符(字符串形式)。 Topic:主题(字符串形式)。 Dialogue_ID:对话的标识符(字符串形式)。 Conv_ID:对话的标识符(字符串形式)。| Dataset name | Train | Valid | Test |
|---|---|---|---|
| dyda_da | 87170 | 8069 | 7740 |
| dyda_e | 87170 | 8069 | 7740 |
| iemocap | 7213 | 805 | 2021 |
| maptask | 20905 | 2963 | 2894 |
| meld_e | 9989 | 1109 | 2610 |
| meld_s | 9989 | 1109 | 2610 |
| mrda | 83944 | 9815 | 15470 |
| oasis | 12076 | 1513 | 1478 |
| sem | 4264 | 485 | 878 |
| swda | 190709 | 21203 | 2714 |
【需要更多信息】
【需要更多信息】
源语言制作人是谁?【需要更多信息】
【需要更多信息】
注释者是谁?【需要更多信息】
【需要更多信息】
【需要更多信息】
【需要更多信息】
【需要更多信息】
Emile Chapuis、Pierre Colombo、Ebenge Usip。
此作品受到 Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Unported License 许可。
@inproceedings{chapuis-etal-2020-hierarchical,
title = "Hierarchical Pre-training for Sequence Labelling in Spoken Dialog",
author = "Chapuis, Emile and
Colombo, Pierre and
Manica, Matteo and
Labeau, Matthieu and
Clavel, Chlo{\'e}",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.239",
doi = "10.18653/v1/2020.findings-emnlp.239",
pages = "2636--2648",
abstract = "Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (SILICONE). SILICONE is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over 2.3 billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.",
}