数据集:

NLPCoreTeam/mmlu_ru

英文

MMLU在俄语中的意义(大规模多任务语言理解)

数据集概述

MMLU数据集用于EN/RU,没有附加的训练集。该数据集包含英语和俄语的开发/验证/测试集。请注意,它不包括没有翻译的辅助训练集。整个数据集每种语言共有约16k个样本:285个开发集,1531个验证集,14042个测试集。

原始MMLU描述

MMLU数据集涵盖了57个不同的任务。每个任务需要在给定问题的四个选项中选择正确答案。论文《测量大规模多任务语言理解》: https://arxiv.org/abs/2009.03300v3 。它也被称为"hendrycks_test"。

数据集创建

翻译是通过Yandex.Translate API进行的。由于术语和公式的翻译错误较多,未进行修正。初始数据集来自: https://people.eecs.berkeley.edu/~hendrycks/data.tar

示例示例

{
    "question_en": "Why doesn't Venus have seasons like Mars and Earth do?",
    "choices_en": [
        "Its rotation axis is nearly perpendicular to the plane of the Solar System.",
        "It does not have an ozone layer.",
        "It does not rotate fast enough.",
        "It is too close to the Sun."
    ],
    "answer": 0,
    "question_ru": "Почему на Венере нет времен года, как на Марсе и Земле?",
    "choices_ru": [
        "Ось его вращения почти перпендикулярна плоскости Солнечной системы.",
        "У него нет озонового слоя.",
        "Он вращается недостаточно быстро.",
        "Это слишком близко к Солнцу."
    ]
}

用法

将所有子集合并为每个拆分的数据框:

from collections import defaultdict

import datasets
import pandas as pd


subjects = ["abstract_algebra", "anatomy", "astronomy", "business_ethics", "clinical_knowledge", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_medicine", "college_physics", "computer_security", "conceptual_physics", "econometrics", "electrical_engineering", "elementary_mathematics", "formal_logic", "global_facts", "high_school_biology", "high_school_chemistry", "high_school_computer_science", "high_school_european_history", "high_school_geography", "high_school_government_and_politics", "high_school_macroeconomics", "high_school_mathematics", "high_school_microeconomics", "high_school_physics", "high_school_psychology", "high_school_statistics", "high_school_us_history", "high_school_world_history", "human_aging", "human_sexuality", "international_law", "jurisprudence", "logical_fallacies", "machine_learning", "management", "marketing", "medical_genetics", "miscellaneous", "moral_disputes", "moral_scenarios", "nutrition", "philosophy", "prehistory", "professional_accounting", "professional_law", "professional_medicine", "professional_psychology", "public_relations", "security_studies", "sociology", "us_foreign_policy", "virology", "world_religions"]

splits = ["dev", "val", "test"]

all_datasets = {x: datasets.load_dataset("NLPCoreTeam/mmlu_ru", name=x) for x in subjects}

res = defaultdict(list)
for subject in subjects:
    for split in splits:
        dataset = all_datasets[subject][split]
        df = dataset.to_pandas()
        int2str = dataset.features['answer'].int2str
        df['answer'] = df['answer'].map(int2str)
        df.insert(loc=0, column='subject_en', value=subject)
        res[split].append(df)

res = {k: pd.concat(v) for k, v in res.items()}

df_dev = res['dev']
df_val = res['val']
df_test = res['test']

评估

此数据集旨在评估具有少数样本/零样本设置的LLM。

评估代码: https://github.com/NLP-Core-Team/mmlu_ru

还可能有用的资源:

  • https://github.com/hendrycks/test
  • https://github.com/openai/evals/blob/main/examples/mmlu.ipynb
  • https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/tasks/hendrycks_test.py
  • 贡献

    数据集由NLP核心团队RnD添加 Telegram channel