数据集:
flax-sentence-embeddings/stackexchange_titlebody_best_voted_answer_jsonl
任务:
子任务:
closed-domain-qa语言:
计算机处理:
multilingual语言创建人:
found批注创建人:
found源数据集:
original许可:
我们从网络上的 Stack Exchange 个网络中自动提取了问题和答案(Q&A)对。Stack Exchange聚集了50个在线平台的许多Q&A社区,包括著名的Stack Overflow和其他技术网站。每月有1亿开发者使用Stack Exchange。该数据集是一个平行语料库,每个问题都映射到评级最高的答案。数据集根据各个社区划分,这些社区涵盖了从3D打印、经济学、树莓派到emacs的各种领域。完整的社区列表可在 here 处获取。
Stack Exchange主要由英文(en)组成。
每个数据样本的表示如下:
{'title_body': 'How to determine if 3 points on a 3-D graph are collinear? Let the points $A, B$ and $C$ be $(x_1, y_1, z_1), (x_2, y_2, z_2)$ and $(x_3, y_3, z_3)$ respectively. How do I prove that the 3 points are collinear? What is the formula?',
'upvoted_answer': 'From $A(x_1,y_1,z_1),B(x_2,y_2,z_2),C(x_3,y_3,z_3)$ we can get their position vectors.\n\n$\\vec{AB}=(x_2-x_1,y_2-y_1,z_2-z_1)$ and $\\vec{AC}=(x_3-x_1,y_3-y_1,z_3-z_1)$.\n\nThen $||\\vec{AB}\\times\\vec{AC}||=0\\implies A,B,C$ collinear.',
这个特定的示例对应于 following page
数据集中的字段包含以下信息:
我们为此数据集提供了多种拆分,每种拆分都指向一个特定的社区频道。我们在下面详细介绍了每个拆分的数量:
| Number of pairs | |
|---|---|
| apple | 92,487 |
| english | 100,640 |
| codereview | 41,748 |
| dba | 71,449 |
| mathoverflow | 85,289 |
| electronics | 129,494 |
| mathematica | 59,895 |
| drupal | 67,817 |
| magento | 79,241 |
| gaming | 82,887 |
| ell | 77,892 |
| gamedev | 40,154 |
| gis | 100,254 |
| askubuntu | 267,135 |
| diy | 52,896 |
| academia | 32,137 |
| blender | 54,153 |
| cs | 30,010 |
| chemistry | 27,061 |
| judaism | 26,085 |
| crypto | 19,404 |
| android | 38,077 |
| ja | 17,376 |
| christianity | 11,498 |
| graphicdesign | 28,083 |
| aviation | 18,755 |
| ethereum | 26,124 |
| biology | 19,277 |
| datascience | 20,503 |
| law | 16,133 |
| dsp | 17,430 |
| japanese | 20,948 |
| hermeneutics | 9,516 |
| bicycles | 15,708 |
| arduino | 16,281 |
| history | 10,766 |
| bitcoin | 22,474 |
| cooking | 22,641 |
| hinduism | 8,999 |
| codegolf | 8,211 |
| boardgames | 11,805 |
| emacs | 16,830 |
| economics | 8,844 |
| gardening | 13,246 |
| astronomy | 9,086 |
| islam | 10,052 |
| german | 13,733 |
| fitness | 8,297 |
| french | 10,578 |
| anime | 10,131 |
| craftcms | 11,236 |
| cstheory | 7,742 |
| engineering | 8,649 |
| buddhism | 6,787 |
| linguistics | 6,843 |
| ai | 5,763 |
| expressionengine | 10,742 |
| cogsci | 5,101 |
| chinese | 8,646 |
| chess | 6,392 |
| civicrm | 10,648 |
| literature | 3,539 |
| interpersonal | 3,398 |
| health | 4,494 |
| avp | 6,450 |
| earthscience | 4,396 |
| joomla | 5,887 |
| homebrew | 5,608 |
| expatriates | 4,913 |
| latin | 3,969 |
| matheducators | 2,706 |
| ham | 3,501 |
| genealogy | 2,895 |
| 3dprinting | 3,488 |
| elementaryos | 5,917 |
| bioinformatics | 3,135 |
| devops | 3,462 |
| hsm | 2,517 |
| italian | 3,101 |
| computergraphics | 2,306 |
| martialarts | 1,737 |
| bricks | 3,530 |
| freelancing | 1,663 |
| crafts | 1,659 |
| lifehacks | 2,576 |
| cseducators | 902 |
| materials | 1,101 |
| hardwarerecs | 2,050 |
| iot | 1,359 |
| eosio | 1,940 |
| languagelearning | 948 |
| korean | 1,406 |
| coffee | 1,188 |
| esperanto | 1,466 |
| beer | 1,012 |
| ebooks | 1,107 |
| iota | 775 |
| cardano | 248 |
| drones | 496 |
| conlang | 334 |
| pt | 103,277 |
| stats | 115,679 |
| unix | 155,414 |
| physics | 141,230 |
| tex | 171,628 |
| serverfault | 238,507 |
| salesforce | 87,272 |
| wordpress | 83,621 |
| softwareengineering | 51,326 |
| scifi | 54,805 |
| security | 51,355 |
| ru | 253,289 |
| superuser | 352,610 |
| sharepoint | 80,420 |
| rpg | 40,435 |
| travel | 36,533 |
| worldbuilding | 26,210 |
| meta | 1,000 |
| workplace | 24,012 |
| ux | 28,901 |
| money | 29,404 |
| webmasters | 30,370 |
| raspberrypi | 24,143 |
| photo | 23,204 |
| music | 19,936 |
| philosophy | 13,114 |
| puzzling | 17,448 |
| movies | 18,243 |
| quant | 12,933 |
| politics | 11,047 |
| space | 12,893 |
| mechanics | 18,613 |
| skeptics | 8,145 |
| rus | 16,528 |
| writers | 9,867 |
| webapps | 24,867 |
| softwarerecs | 11,761 |
| networkengineering | 12,590 |
| parenting | 5,998 |
| scicomp | 7,036 |
| sqa | 9,256 |
| sitecore | 7,838 |
| vi | 9,000 |
| spanish | 7,675 |
| pm | 5,435 |
| pets | 6,156 |
| sound | 8,303 |
| reverseengineering | 5,817 |
| outdoors | 5,278 |
| tridion | 5,907 |
| retrocomputing | 3,907 |
| robotics | 4,648 |
| quantumcomputing | 4,320 |
| sports | 4,707 |
| russian | 3,937 |
| opensource | 3,221 |
| woodworking | 2,955 |
| patents | 3,573 |
| tor | 4,167 |
| ukrainian | 1,767 |
| opendata | 3,842 |
| monero | 3,508 |
| sustainability | 1,674 |
| portuguese | 1,964 |
| mythology | 1,595 |
| musicfans | 2,431 |
| or | 1,490 |
| poker | 1,665 |
| windowsphone | 2,807 |
| moderators | 504 |
| stackapps | 1,518 |
| stellar | 1,078 |
| vegetarianism | 585 |
| tezos | 1,169 |
| total | 4,750,619 |
我们主要为句子嵌入训练而设计了此数据集。实际上,句子嵌入可以使用对比学习设置进行训练,其中模型被训练以将每个句子与其对应的多个选项中的句子进行关联。这样的模型需要许多示例才能有效,因此数据集的创建可能是繁琐的。像Stack Exchange这样的社区网络使我们能够半自动地构建许多示例。
源数据来自 Stack Exchange 的转储数据。
Initial Data Collection and Normalization我们从数学社区收集了数据。
我们过滤掉标题或正文长度低于20个字符以及正文长度超过4096个字符的问题。在提取最受欢迎的答案时,我们过滤掉那些最受欢迎的答案和踩票数之间至少有100票差距的配对。
Who are the source language producers?问题和答案是由Stack Exchange的社区开发者编写的。
请参阅许可信息: https://archive.org/details/stackexchange
@misc{StackExchangeDataset,
author = {Flax Sentence Embeddings Team},
title = {Stack Exchange question pairs},
year = {2021},
howpublished = {https://huggingface.co/datasets/flax-sentence-embeddings/},
}
感谢Flax Sentence Embeddings团队添加了这个数据集。