数据集:

stsb_multi_mt

英文

STSb多语言MT数据集数据卡片

数据集简介

STS基准包含了在SemEval 2012至2017年举办的STS任务中使用的英文数据集的选择。这些数据集包括来自图像标题、新闻标题和用户论坛的文本。( source

这些是不同的多语言翻译和英文原文( STSbenchmark dataset )。翻译是使用( deepl.com )完成的。它可用于像( sentence embeddings )这样的训练任务。

使用示例

加载德语开发数据集:

from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="de", split="dev")

加载英语训练数据集:

from datasets import load_dataset
dataset = load_dataset("stsb_multi_mt", name="en", split="train")

支持的任务和排行榜

[需要更多信息]

语言

可用语言:de, en, es, fr, it, nl, pl, pt, ru, zh

数据集结构

数据实例

该数据集提供了一对句子及其相似度得分。

score 2 example sentences explanation
5 The bird is bathing in the sink. Birdie is washing itself in the water basin. The two sentences are completely equivalent, as they mean the same thing.
4 Two boys on a couch are playing video games. Two boys are playing a video game. The two sentences are mostly equivalent, but some unimportant details differ.
3 John said he is considered a witness but not a suspect. “He is not a suspect anymore.” John said. The two sentences are roughly equivalent, but some important information differs/missing.
2 They flew out of the nest in groups. They flew into the nest together. The two sentences are not equivalent, but share some details.
1 The woman is playing the violin. The young lady enjoys listening to the guitar. The two sentences are not equivalent, but are on the same topic.
0 The black dog is running through the snow. A race car driver is driving his car through the mud. The two sentences are completely dissimilar.

示例:

{
    "sentence1": "A man is playing a large flute.",
    "sentence2": "A man is playing a flute.",
    "similarity_score": 3.8
}

数据字段

  • 句子1:第一个句子,类型为str。
  • 句子2:第二个句子,类型为str。
  • 相似度得分:相似度得分,类型为float,取值范围为0.0至5.0。

数据拆分

  • 训练集包含5749个样本。
  • 开发集包含1500个样本。
  • 测试集包含1379个样本。

数据集创建

策划理由

[需要更多信息]

源数据

数据收集和规范化

[需要更多信息]

源语言制作人是谁?

[需要更多信息]

标注

标注过程

[需要更多信息]

标注者是谁?

[需要更多信息]

个人和敏感信息

[需要更多信息]

使用数据时的注意事项

数据的社会影响

[需要更多信息]

偏见讨论

[需要更多信息]

其他已知限制

[需要更多信息]

其他信息

数据集策划者

[需要更多信息]

许可信息

请参阅 LICENSE download at original dataset

引用信息

@InProceedings{huggingface:dataset:stsb_multi_mt,
title = {Machine translated multilingual STS benchmark dataset.},
author={Philip May},
year={2021},
url={https://github.com/PhilipMay/stsb-multi-mt}
}

贡献者

感谢 @PhilipMay 添加了该数据集。