数据集:
medmcqa
语言:
en计算机处理:
monolingual大小:
100K<n<1M语言创建人:
expert-generated批注创建人:
no-annotation源数据集:
original许可:
apache-2.0MedMCQA 是一个大规模的多项选择问题回答(MCQA)数据集,旨在解决现实世界的医学入学考试问题。
MedMCQA 拥有超过194,000个高质量的AIIMS和NEET PG入学考试MCQs,涵盖2,400个医疗主题和21个医学科目,平均令牌长度为12.77,具有高度的主题多样性。
每个样本包含一个问题、正确答案和其他选项,需要更深入的语言理解,因为它在广泛的医学科目和主题中测试了模型的10+推理能力。这项研究还提供了解决方案的详细解释。
MedMCQA为自然语言处理社区提供了一个开源数据集。预计该数据集将有助于未来研究,以实现更好的问答系统。该数据集包含以下主题的问题:
多项选择型问题回答,开放领域问题回答:该数据集可用于训练多项选择型问题回答模型、开放领域问题回答模型。这些考试中的问题具有挑战性,通常需要更深入的领域和语言理解,因为它测试了模型在广泛的医学科目和主题中的10+推理能力。
问题和答案以英文提供。
{ "question":"A 40-year-old man presents with 5 days of productive cough and fever. Pseudomonas aeruginosa is isolated from a pulmonary abscess. CBC shows an acute effect characterized by marked leukocytosis (50,000 mL) and the differential count reveals a shift to left in granulocytes. Which of the following terms best describes these hematologic findings?", "exp": "Circulating levels of leukocytes and their precursors may occasionally reach very high levels (>50,000 WBC mL). These extreme elevations are sometimes called leukemoid reactions because they are similar to the white cell counts observed in leukemia, from which they must be distinguished. The leukocytosis occurs initially because of the accelerated release of granulocytes from the bone marrow (caused by cytokines, including TNF and IL-1) There is a rise in the number of both mature and immature neutrophils in the blood, referred to as a shift to the left. In contrast to bacterial infections, viral infections (including infectious mononucleosis) are characterized by lymphocytosis Parasitic infestations and certain allergic reactions cause eosinophilia, an increase in the number of circulating eosinophils. Leukopenia is defined as an absolute decrease in the circulating WBC count.", "cop":1, "opa":"Leukemoid reaction", "opb":"Leukopenia", "opc":"Myeloid metaplasia", "opd":"Neutrophilia", "subject_name":"Pathology", "topic_name":"Basic Concepts and Vascular changes of Acute Inflammation", "id":"4e1715fe-0bc3-494e-b6eb-2d4617245aef", "choice_type":"single" }
MedMCQA 的目标是模拟真实的医学考试的严谨性。为了实现这一目标,提供了数据集的预定义拆分。拆分是按照考试而不是给定的问题进行的。这也确保了模型的可重用性和泛化能力。
MedMCQA 的训练集包括所有收集到的模拟和在线测试系列,而测试集包括所有AIIMS PG考试MCQ (1991年至今)。开发集包括NEET PG考试MCQ (2001年至今),以近似实际考试评估。
基于相似性删除了来自训练集、测试集和开发集的相似问题。最终的拆分大小如下所示:
Train | Test | Valid | |
---|---|---|---|
Question # | 182,822 | 6,150 | 4,183 |
Vocab | 94,231 | 11,218 | 10,800 |
Max Ques tokens | 220 | 135 | 88 |
Max Ans tokens | 38 | 21 | 25 |
在此之前,关于构建生物医学MCQA数据集的工作很少(Vilares和Gomez-Rodr,2019),它们(1)大多规模较小,包含几千个问题,以及(2)只涵盖了有限数量的医学主题和学科。本论文通过引入MedMCQA来解决上述限制,该数据集是一个新的大规模多项选择问题回答(MCQA)数据集,旨在解决现实世界的医学入学考试问题。
官方网站的历年考试题目 - AIIMS和NEET PG(1991-至今),原始数据收集自开放网站和书籍
谁是源语言的生产者?该数据集由Ankit Pal,Logesh Kumar Umapathi和Malaikannan Sankarasubbu创建
数据集不包含任何额外的注释。
谁是标注者?[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
[需要更多信息]
如果您在研究中发现此数据集有用,请考虑引用数据集论文
@InProceedings{pmlr-v174-pal22a, title = {MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering}, author = {Pal, Ankit and Umapathi, Logesh Kumar and Sankarasubbu, Malaikannan}, booktitle = {Proceedings of the Conference on Health, Inference, and Learning}, pages = {248--260}, year = {2022}, editor = {Flores, Gerardo and Chen, George H and Pollard, Tom and Ho, Joyce C and Naumann, Tristan}, volume = {174}, series = {Proceedings of Machine Learning Research}, month = {07--08 Apr}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v174/pal22a/pal22a.pdf}, url = {https://proceedings.mlr.press/v174/pal22a.html}, abstract = {This paper introduces MedMCQA, a new large-scale, Multiple-Choice Question Answering (MCQA) dataset designed to address real-world medical entrance exam questions. More than 194k high-quality AIIMS & NEET PG entrance exam MCQs covering 2.4k healthcare topics and 21 medical subjects are collected with an average token length of 12.77 and high topical diversity. Each sample contains a question, correct answer(s), and other options which requires a deeper language understanding as it tests the 10+ reasoning abilities of a model across a wide range of medical subjects & topics. A detailed explanation of the solution, along with the above information, is provided in this study.} }
感谢 @monk1337 添加了此数据集。