数据集:

flax-sentence-embeddings/stackexchange_titlebody_best_voted_answer_jsonl

任务:

问答

语言:

en

计算机处理:

multilingual

语言创建人:

found

批注创建人:

found

源数据集:

original
英文

数据集卡创建指南

数据集概要

我们从网络上的 Stack Exchange 个网络中自动提取了问题和答案(Q&A)对。Stack Exchange聚集了50个在线平台的许多Q&A社区,包括著名的Stack Overflow和其他技术网站。每月有1亿开发者使用Stack Exchange。该数据集是一个平行语料库,每个问题都映射到评级最高的答案。数据集根据各个社区划分,这些社区涵盖了从3D打印、经济学、树莓派到emacs的各种领域。完整的社区列表可在 here 处获取。

语言

Stack Exchange主要由英文(en)组成。

数据集结构

数据实例

每个数据样本的表示如下:

{'title_body': 'How to determine if 3 points on a 3-D graph are collinear? Let the points $A, B$ and $C$ be $(x_1, y_1, z_1), (x_2, y_2, z_2)$ and $(x_3, y_3, z_3)$ respectively. How do I prove that the 3 points are collinear? What is the formula?',
 'upvoted_answer': 'From $A(x_1,y_1,z_1),B(x_2,y_2,z_2),C(x_3,y_3,z_3)$ we can get their position vectors.\n\n$\\vec{AB}=(x_2-x_1,y_2-y_1,z_2-z_1)$ and $\\vec{AC}=(x_3-x_1,y_3-y_1,z_3-z_1)$.\n\nThen $||\\vec{AB}\\times\\vec{AC}||=0\\implies A,B,C$ collinear.',

这个特定的示例对应于 following page

数据字段

数据集中的字段包含以下信息:

  • title_body: 这是问题的标题和正文的连接
  • upvoted_answer: 这是评分最高的答案的正文

数据拆分

我们为此数据集提供了多种拆分,每种拆分都指向一个特定的社区频道。我们在下面详细介绍了每个拆分的数量:

Number of pairs
apple 92,487
english 100,640
codereview 41,748
dba 71,449
mathoverflow 85,289
electronics 129,494
mathematica 59,895
drupal 67,817
magento 79,241
gaming 82,887
ell 77,892
gamedev 40,154
gis 100,254
askubuntu 267,135
diy 52,896
academia 32,137
blender 54,153
cs 30,010
chemistry 27,061
judaism 26,085
crypto 19,404
android 38,077
ja 17,376
christianity 11,498
graphicdesign 28,083
aviation 18,755
ethereum 26,124
biology 19,277
datascience 20,503
law 16,133
dsp 17,430
japanese 20,948
hermeneutics 9,516
bicycles 15,708
arduino 16,281
history 10,766
bitcoin 22,474
cooking 22,641
hinduism 8,999
codegolf 8,211
boardgames 11,805
emacs 16,830
economics 8,844
gardening 13,246
astronomy 9,086
islam 10,052
german 13,733
fitness 8,297
french 10,578
anime 10,131
craftcms 11,236
cstheory 7,742
engineering 8,649
buddhism 6,787
linguistics 6,843
ai 5,763
expressionengine 10,742
cogsci 5,101
chinese 8,646
chess 6,392
civicrm 10,648
literature 3,539
interpersonal 3,398
health 4,494
avp 6,450
earthscience 4,396
joomla 5,887
homebrew 5,608
expatriates 4,913
latin 3,969
matheducators 2,706
ham 3,501
genealogy 2,895
3dprinting 3,488
elementaryos 5,917
bioinformatics 3,135
devops 3,462
hsm 2,517
italian 3,101
computergraphics 2,306
martialarts 1,737
bricks 3,530
freelancing 1,663
crafts 1,659
lifehacks 2,576
cseducators 902
materials 1,101
hardwarerecs 2,050
iot 1,359
eosio 1,940
languagelearning 948
korean 1,406
coffee 1,188
esperanto 1,466
beer 1,012
ebooks 1,107
iota 775
cardano 248
drones 496
conlang 334
pt 103,277
stats 115,679
unix 155,414
physics 141,230
tex 171,628
serverfault 238,507
salesforce 87,272
wordpress 83,621
softwareengineering 51,326
scifi 54,805
security 51,355
ru 253,289
superuser 352,610
sharepoint 80,420
rpg 40,435
travel 36,533
worldbuilding 26,210
meta 1,000
workplace 24,012
ux 28,901
money 29,404
webmasters 30,370
raspberrypi 24,143
photo 23,204
music 19,936
philosophy 13,114
puzzling 17,448
movies 18,243
quant 12,933
politics 11,047
space 12,893
mechanics 18,613
skeptics 8,145
rus 16,528
writers 9,867
webapps 24,867
softwarerecs 11,761
networkengineering 12,590
parenting 5,998
scicomp 7,036
sqa 9,256
sitecore 7,838
vi 9,000
spanish 7,675
pm 5,435
pets 6,156
sound 8,303
reverseengineering 5,817
outdoors 5,278
tridion 5,907
retrocomputing 3,907
robotics 4,648
quantumcomputing 4,320
sports 4,707
russian 3,937
opensource 3,221
woodworking 2,955
patents 3,573
tor 4,167
ukrainian 1,767
opendata 3,842
monero 3,508
sustainability 1,674
portuguese 1,964
mythology 1,595
musicfans 2,431
or 1,490
poker 1,665
windowsphone 2,807
moderators 504
stackapps 1,518
stellar 1,078
vegetarianism 585
tezos 1,169
total 4,750,619

数据集创建

策划理由

我们主要为句子嵌入训练而设计了此数据集。实际上,句子嵌入可以使用对比学习设置进行训练,其中模型被训练以将每个句子与其对应的多个选项中的句子进行关联。这样的模型需要许多示例才能有效,因此数据集的创建可能是繁琐的。像Stack Exchange这样的社区网络使我们能够半自动地构建许多示例。

来源数据

源数据来自 Stack Exchange 的转储数据。

Initial Data Collection and Normalization

我们从数学社区收集了数据。

我们过滤掉标题或正文长度低于20个字符以及正文长度超过4096个字符的问题。在提取最受欢迎的答案时,我们过滤掉那些最受欢迎的答案和踩票数之间至少有100票差距的配对。

Who are the source language producers?

问题和答案是由Stack Exchange的社区开发者编写的。

附加信息

许可信息

请参阅许可信息: https://archive.org/details/stackexchange

引用信息

@misc{StackExchangeDataset,
  author = {Flax Sentence Embeddings Team},
  title = {Stack Exchange question pairs},
  year = {2021},
  howpublished = {https://huggingface.co/datasets/flax-sentence-embeddings/},
}

贡献

感谢Flax Sentence Embeddings团队添加了这个数据集。